Friday, November 26, 2010

Bit and Byte in Java and C++

There are a lot of bitwise operations involved in the C++ code.

signed and unsigned
C++ has "unsigned" type for each of the primitive types, for example char, int, etc.
What it means is to assume all values stored are positive. Bytes that were used to represent positive or negative can be used to stored additional values. Thus, unsigned data type can store up to double the range of the signed data type.

For example,
char in C:
signed: -128 to 127
unsigned: 0 to 255

Java has no "unsigned" keyword. Programmer cannot define signed or unsigned data type. Most of the data types are signed. int, short, long, etc.

The C++ program uses unsigned char, which can stored values between 0-255.

char in C++ and Java
char in Java is Unicode(2 bytes) - support chinese, japanese, etc.
char in C(1 byte) - only supports ASCII character(A-Z, 0-9, etc).

As the char in C++ is 1 byte, and it provides more freedom to play with memory. The C++ code directly manipulate byte and char with XOR operations.

Mapping char to the byte value
The alphabets and numbers are mapped to the same value in ASCII and Unicode. As long as only ASCII characters are used we can be sure that the values returned will fall in the range of 0-127, thus can be stored in byte of Java (-128 to 127). Value of 'A' in Unicode is same as the value in ASCII.

In Java, we just need to call .getBytes() method on the String to get a byte[] array for each of the char in the String. By default it would use the system default charset. We should defined the character encoding to make the result consistent across different platform. For example,getBytes("US-ASCII");

Endian
Endian is the way computer store the data in memory. It is fairly low level and normally we don't need to know/care about how the computer arrange the bytes.

12345678
Big Endian - 12-34-56-78
Little Endian - 78-56-34-12

Everything in Java is Big Endian while in C++ it depends on the implementation and CPU architecture. On x86 machine, it is Little Endian. I suspect it might affect us as we deal with bitwise operation in C++ and Java.

Luckily we didn't deal with more than 1 byte at a time. Each of the character in the String is mapped to 1 byte value and stored to array. Thus, endianness does not affect the output. '1' is 1, '2' is 2 no matter it is read backward or forward. =)

Zero and 0
Character '0' is not the same as the value 0. Character '0' is represented by value 48 in ASCII.

In the C++ program, 0x00 (0 in hex format) is appended to the char[] array. In this case, value 0 is being appended to the array, not the character '0', which would result in appending value 48 into the array.

In conclusion
Documentation matters. Standard matters.

The XOR part of the code(assume it is some kind of hashing) gives us a hard time compare to the AES encryption. AES encryption produces the same output if confirms to the standard. In Java, it is just a matter of calling the method in the library.

That's what I understand so far. Feel free to correct me if there's any error.

Assumption:
the C++ code is compiled and run on 32-bit machine.

Reference: