String is a collection of characters. Any character can be accessed by its index. The indexing of a string starts at 0 (or -1 if it’s indexed from the end). We can get the number of characters in a string by using built-in function len. Compared to the indexing, len is not zero based.

In [1]:

# Creating a string
text = 'Some collection of words'

# Assigning the number of characters in the given string to a variable
total_char = len(text)

# Printing the total number of characters
print('Number of characters = {}'.format(total_char))

Number of characters = 24

Let’s look at the indexing of a string. As it has been already mentioned, indexing is zero based. Indexing is in the range (0, 23) while the total amount of characters in the given string is 24. It is demonstrated in the cell below. The characters are looped and printed with the corresponding indices under each character.

{0:3} adds 3 space holders for each character printed. print() function adds new line after it’s executed. By using end=”” we can continue printing on the same line.

In [2]:

# Looping through the characters in the given string
for letter in text:
    print('{0:3}'.format(letter), end="")

# Switching to the next line
print()

# Integers are right-aligned by default. We can use '<' to align to the left
for i in range(total_char):
    print('{0:<3d}'.format(i), end="")

S  o  m  e     c  o  l  l  e  c  t  i  o  n     o  f     w  o  r  d  s  
0  1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16 17 18 19 20 21 22 23

Similar output but with the old formatting, using the % operator.

In [3]:

for letter in text:
    print('%-3s' % letter, end="")
    
print()

for i in range(len(text)):
    index = i - 1
    print('%-3s' % i, end="")

S  o  m  e     c  o  l  l  e  c  t  i  o  n     o  f     w  o  r  d  s  
0  1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16 17 18 19 20 21 22 23

String can be sliced. We define slicing from the start index to the last index we need, excluding the last. Also string can be printed backwards:

In [4]:

text[0:10]

Out[4]:

'Some colle'

In [5]:

text[::-1]

Out[5]:

'sdrow fo noitcelloc emoS'

Example

Let’s look at some string methods and their implementation. Kaggle Titanic challenge provides a data set which is split in two different sets (training and test sets). One of the columns provided in the data set contains the names of the passengers. At the first glance it may look like that the column is unusable, but by implementing feature engineering we can extract additional data from this column. If we look closely, we can spot that the names are in the format Last Name, Title, First (and sometimes also Middle name), ex: Braund, Mr. Owen Harris. The data is in string format. We can extract the title from each row of names, and categorize them. One of the solutions available for such problem is shown below:

In [6]:

# Importing pandas library
import pandas as pd

# Loading the given data set to a pandas DataFrame
df = pd.read_csv('data/train.csv')
df.head()

Out[6]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th…	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

Each name starts with a last name, followed by a comma, and a blank space after comma. Title is followed by a period sign. Using find method, we can detect the position of the characters we are looking for. In this case we are looking for comma (,) and period (.) characters. Let’s look at a single example:

In [7]:

# Creating a string
name = 'Braund, Mr. Owen Harris'

# Detecting the indices of the required characters
name.find(','), name.find('.')

Out[7]:

(6, 10)

find method shows that the comma is at the index 6, and the period is at the index 10 for the given string. Title for the given string (in this case Mr) can be sliced by using index the range (8, 10).

In [8]:

# Slicing the string
name[name.find(',') + 2 : name.find('.')]

Out[8]:

'Mr'

Once the general pattern is available, for loop can be implemented to slice each name in the column to extract the title, and later to save the title to the given DataFrame.

In [9]:

# Creating an empty list to store the titles
array = []

# Looping through the DataFrame
for name in df['Name']:
    # Extracting the title by slicing the string
    title = name[name.find(',') + 2 : name.find('.')]
    # Appending the title to the list
    array.append(title)

# Appending the list to the DataFrame under the column 'Title'
df['Title'] = array

# Checking the first 5 titles
df['Title'][:5]

Out[9]:

0      Mr
1     Mrs
2    Miss
3     Mrs
4      Mr
Name: Title, dtype: object

There is another way of achieving the same result by using split and strip methods.

In [10]:

# Creating a string
name = 'Braund, Mr. Owen Harris'

# Applying split and strip methods
name.split(',')[1].split('.')[0].strip()

Out[10]:

'Mr'

Let’s apply this way step by step:

Name string is split into an array of two strings at the comma sign.
The first index (the second element) of an array is selected and then split again. This time the new string is split into an array of two strings at the period sign.
At the end, by applying strip method, blank space is removed from the 0 index of an array.

Step by step is demonstrated in the below code:

In [11]:

print('Step 1 - {}'.format(name.split(',')))
print('Step 2 - {}'.format(name.split(',')[1].split('.')))
print('Step 3 - {}'.format(name.split(',')[1].split('.')[0].strip()))

Step 1 - ['Braund', ' Mr. Owen Harris']
Step 2 - [' Mr', ' Owen Harris']
Step 3 - Mr

Lambda expression and pandas.map() method can be used for a cleaner code, substituting the for loop.

In [12]:

# Appending a new column to the DataFrame and applying lambda expression
df['NewTitle'] = df['Name'].map(lambda name:name.split(',')[1].split('.')[0].strip())

# Checking the first 5 rows of the DataFrame
df.head()

Out[12]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	Title	NewTitle
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S	Mr	Mr
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th…	female	38.0	1	PC 17599	71.2833	C85	C	Mrs	Mrs
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S	Miss	Miss
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S	Mrs	Mrs
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S	Mr	Mr

Downloadable content:
train.csv

codeWithMax

codeWithMax

String, String Methods, and String Manipulation

Example

Machine Learning – Programming Exercise 2: Logistic Regression

Reading – January 2018

Leave a Reply Cancel reply

codeWithMax

String, String Methods, and String Manipulation

Example

Machine Learning – Programming Exercise 2: Logistic Regression

Reading – January 2018

Leave a Reply Cancel reply

Related posts

Plotting Error Bars in Python using Matplotlib and Numpy Random

Basic Image Recognition with Built-in Models in Keras

Basic Example of a Neural Network with TensorFlow and Keras

Pandas – Tips and Tricks – df.loc, df.iloc