This paper covers the
history and use of literals (or constants) in programming languages, from the
beginning of programming to the present day. Literals in many programming
languages are discussed including modern languages such as C, Java, scripting
languages, and older languages such as
Design
Issues for Integer Constants
Visual
Basic .NET Type Designations
Design
Issues for Floating Point Constants
Design
Issues for Character Strings
Perl
and UNIX Shell Character Strings
Perl
Additional Escape Sequences
String
Comparison==move to strings
New
FORTRAN Declarations why here, move to type chapter??
Copyright Dennie Van Tassel 2004.
Please send suggestions and comments to dvantassel@gavilan.edu
Literals or constants are the values we write in a conventional form whose value is obvious. In contrast to variables, literals (123, 4.3, hi) do not change in value. These are also called explicit constants or manifest constants. I have also seen these called pure constants, but I am not sure if that terminology is agreed on. At first glance one may think that all programming languages type their literals the same way. While there are a lot of common conventions in different languages there are some interesting differences.
|
Literal |
Explanation |
|
285 |
Typical integer |
|
34.67 |
Typical real |
|
4.23E-4 |
Typical scientific |
|
140_345 |
Integer in Perl or |
|
true |
Typical boolean |
|
0x1b or
Z"1B" |
Hexadecimal literal |
|
'B' |
Typical character |
|
"Hello" or
'Hello' |
Typical character string |
|
5HHello |
Old FORTRAN Hollerith string |
|
null ZERO |
Special literals |
Various Literals in Different Languages
Table x.1
Literals represent the possible choices in primitive types for that language. Some of the choices of types of literals are often integers, floating point, Booleans and character strings. Each of these will be discussed in this chapter.
Integers are commonly described as numbers without a decimal point or exponent. Another description for integer literals is a string of decimal digits without a decimal point. Thus the following are valid integers in all languages:
123
0 -14 21345
Integers may or may not have a sign and must fall within some restricted range. Negative values need to be preceded by a minus sign. If integers use 32 bits, then the maximum value would be 2^31 1 (since we need to use one bit for negative numbers).
There are two more integer constants available in some languages:
+45
5e2
Early C did not allow +45 since integers without a sign, such as just 45, are positive by default, so no unary positive sign was used. Thus C had a unary negative operator but no unary positive operator. But many later C compilers and Java allow the unneeded positive signs on constants. Few other languages actually forbid unary positive signs.
The last constant 5e2 which would
evaluate to 500 would be a floating point value in C and FORTRAN. Their rule is
a floating point constant has a decimal point OR exponent, or both. Thus 5.0,
5e0, and 5.0e0 would all be the same floating point 5.0. But in
There are a few design issues for integers. They are:
There is a yes answer to all the above questions in some language, and different languages have different answers.
Most languages have one or more default size for integers available. On a 16-bit word size machine integers range from 32,768 to +32,767, which is about 2^15 - 1. On a 32-bit word size machine integers range from 2,147,483,648 to +2,147,483,647, which is about 2^31 1. Today 64-bit integers are common. Unfortunately, computer integers cannot have those useful commas to mark thousands.
But this is an over simplification since we can have hexadecimal integers and they use letters. And we may want octal values and some way to indicate the desired size of our integers. Also, the definition of integer in the previous paragraph is not true for all languages.
For example, in Ada both integer and real literals can have an exponent. Thus
in
21e2
210e+1 2100e+0
But in many other languages the exponent would indicate that the above are floating point literals. For integers, the exponent must be positive. Ada allows us to use the underscore to improve readability. The underscore is often used to separate a number into groups of three digits like commas are used in non-programming areas. Here are some examples:
1_234.56
408_847_1400 1_000_000 12_27_05
4_345e2
In most of the above numbers the underscore is placed where a comma would normally be, but the underscore can be placed in any convenient place. Perl and Ruby also allow underscores in their integers.
If we have more than one size of integers, we need some way to indicate the precision of the integer constant. The C family uses an L or l (ell) after an integer to indicate a long integer. Thus 12L is used for a long integer. We can use the lower case l but few can tell the difference between 12l (12 and L) or 121 (12 and one), so we always use an upper case L. These suffixes are useful to force arithmetic into a particular precision.
Besides long integers, we have unsigned integers in C, which use the suffix u or U. Thus we could write 15u or 15U to get the unsigned integer fifteen. Long unsigned integers are indicated with the terminating ul or UL, so 23ul or 23UL will get an unsigned long integer twenty-three. For regular integers one bit must be saved to store the sign of the integer. If a variable or constant is unsigned, then that bit can be used for the integer. Thus a signed integer may have 2^15-1 or -32,768 to +32,767, but an unsigned integer stored in the same amount of storage can go from 0 to +65,535 which is 2^16-1.
If we are in a language that has long integers, then how do we use them? For example, if we write 123456789012, we do not want to end up with an integer overflow or truncation. A good compiler would automatically store this integer as a long integer, but we may want to help it (or us) with 123456789012L.
In most languages long integers are restricted to some large size. Python uses the same L to indicate a long integer like, 12345678901234567890L, but Python long integers can be arbitrarily big. Other languages such as Ruby and Lisp dialects have these arbitrarily long integers and are called bignum systems.
These forms of BASIC have two types of integers. The two types are integer and long integer. Early BASIC did not have types for numbers. There was no distinction between integers and floating point. But now we have several numeric types. For numeric constants a suffix is used on the number to indicate the type. Here is what they use:
|
Numeric Type |
Suffix |
Bytes of Storage |
|
Integer |
% |
2 |
|
Long integer |
& |
4 |
|
Single precision |
none or ! |
4 |
|
Double precision |
# |
8 |
Types in BASIC
Table x.2
Thus 15% is an integer, while 15& is a long integer, and 15 (or 15!) is a floating point, single precision float. By default all numbers are real (floating point) single precision. If we want a double precision float 15, then we type 15#.
VB .NET has broken from its BASIC parents and changed the type-designations characters they append to numeric literals. Whole numbers (no decimal points) are type Integer and numbers with decimal points are type Double. Otherwise, they use a method similar to previous dialects of BASIC, but use different codes to change the default type. VB .NET codes are as follows:
S Short integer
I Integer
L Long integer
F Single-precision floating point
R Double-precision floating point
D Decimal
So they have three types of integers and two types of floating point. They use Decimal for decimal fractions such as dollars and cents. Thus 45S is a Short integer, 45I (or 45) is an Integer, and 45L is a Long integer. And 234.5F is a Single-precision floating point literal and 234.5R (or 234.5) is a Double-precision floating point literal. Finally, 780.23D is Decimal currency-type literal.
The range of values for VB .Net is much larger than previous languages. For example, long integer range from ±9x10^18. C# .NET has similar types and value ranges.
Sometimes we want a different base or radix of our constants besides base 10. Base 8 and base 16 are useful for storage addresses. The C family allows us to indicate octal constants by preceding the number with a zero. So 012 is octal 12, not decimal twelve. For octal values the range of digits is 0-7.
So putting this together with what we learned in the previous section we can use the terminating L to make the constant Long and the U to make it unsigned. Thus 012UL is the unsigned long octal value 12 or the equivalent of the decimal value 10.
For hexadecimal values we need to precede the number with an 0x or 0X. Thus 0x12 is hexadecimal 12, not decimal 12. Now the range of acceptable digits is 0 1 2 3 ... 9 A B E F. We can use upper or lower case letters a-f. Again we can use long integer indicator L on these too. Thus 07L is a long octal seven, and 0x7L is a long hexadecimal seven. We can also use the terminating U to make it unsigned. Thus 0XFUL is the unsigned long hexadecimal value F, which is equivalent to the decimal value 15.
Ruby does the same for octal and hexadecimal literals as C does, but Ruby has added 0b for binary numbers. So in Ruby we can have hexadecimal values like 0x12, octal values like 012, and binary values like 0b1001.
FORTRAN 90 does this a little differently. They allow radix (number base) 2, 8, or 16. They start the value with letter B for binary or radix 2, letter O (oh) for octal, and letter Z for hexadecimal. Then the number follows by a string of digits enclosed in double or single quotes. The range of digits must be acceptable for the desired base (no 8 or 9 in octals). The integer value 200 would be B11001000 for base two, O310 for base eight, and ZC8 for base 16. I try very hard not to be chauvinistic, but I sure like the C method better in this case.
This FORTRAN 90 solution illustrates the problem of adding a feature to an existing language. They cannot just decide to use the C solution, that all numbers starting with a zero are octal values. Millions of old FORTRAN programs would no longer work correctly when compiled on new FORTRAN 90 compilers, since 012 would be octal 12 instead of decimal 12. On the positive side of this change, thousands of old FORTRAN programmers would suddenly have employment.
2#100011# 4#203#
8#43# 10#35# 16#23#
While this is kind of interesting, I do not see much use for
base 7 or 11, but obviously someone did. In addition, C and FORTRAN 90 can only
use octal or hexadecimal integer constants;
1. Suppose you wanted to add more
bases to Java or C++. Presently, those languages can only handle decimal,
octal, and hexadecimal. The
Reals are numbers with a decimal point, thus 4.3 is a real literal. Real numbers are called floats or floating point in some languages. Another descriptions of reals is a number with a decimal point or an exponent (or both), thus 2e2 would be a real literal using this definition. Like integer literals, a positive or negative sign can precede the number and no commas are allowed. Thus some real literals are:
0.0
-4.302 7. 3.2e-4
4.9678E+3 4e-3
If the language accepts both lower and upper case, the e for exponent can be lower case or upper case. It may vary by language if 4e-3 is acceptable, or we may need 4.0e-3 (with a decimal point). The e stands for exponent and means multiply by 10 the value that follows. Thus
4.3e2 = 4.3 x 10^2 = 4.3 x 100 = 430.0
Scientific notation is useful for expressing very small numbers or very large values (such as your chances to win the lottery or the national debt).
There are a few design issues for floating point constants. Here are some: