1. Preliminary
The union budget of India for the year 2019-2020 was presented by finance minister Nirmala Sitharaman on 05 July 2019. This was the National Democratic Alliance – II (“NDA-II”) government’s first budget. Shortly after Sitharaman’s presentation of the budget, IndiaSpend, a non-profit organization that uses open data to analyse India’s socio-economic issues, published an insightful piece comparing her speech in 2019 with former finance minister Arun Jaitley’s speech from 2014, when the Modi government first came to power. IndiaSpend compared the number of times certain keywords (such as agriculture, tax, renewable energy etc.) occurred in Sitharaman’s speech against the count of the same keywords in Jaitley’s speech in 2014. They also measured the amount of time Sitharaman and Jaitley spoke about a particular issue in their respective speeches. Through this comparison, it was found that there were a number of similarities in the priorities of the first and second Modi government’s budget speeches in 2014 and 2019 respectively. The piece concluded that both these speeches focused more on physical infrastructure, taxation, banking, and the financial sector and less on health, education, climate change, and labour.
Building on this method of analysing keywords, Ikigai Law has initiated a project, titled ‘Budget Speeches Analysis’, which compares the frequency of keywords used in the past five union budget speeches (from the years 2014 to 2019) and one interim budget speech (of February 2019) and attempts to identify word frequency[1] trends. Moreover, the project attempts to match trends in these keywords with the trends in indicators of specific sectors of the Indian economy over the last five years. In this blog post, we will detail the methodology used to generate word frequencies for each union budget and outline the application of this methodology for deriving data-driven analysis of the different budget speeches.
2. Methodology
This analysis makes use of ‘collections’ and ‘operator’ libraries to derive word frequencies on Python 3.0. A brief explanation of these terms is given below:
Python: Python is a high-level, object-oriented, open-source programming language. Python 3.0 is a version of the Python language released in 2008. Python’s unique features led to its widespread adoption by developers of emerging technologies. For instance, most artificial intelligence and analytics tools are built on Python.
Libraries in computer programming: Simply put, a library (sometimes referred to as a module or package), is like a collection of different readymade tools that can be used together to accomplish a larger task. It resembles a mechanical toolbox that consists of different readymade tools and instruments, all of which can be used together to fix a car without having to manufacture the instruments. It is this collection of readymade ‘functions’ in a library that enables the user to perform an action without having to write the entire code.
Functions in computer programming: A function is an independent, organized, reusable chunk of code that is used to perform a single task. Functions resemble individual tools of a mechanical toolbox that can perform one particular action. Usually, functions take a set of values as an input and give a set of values as an output. A function is denoted as “function_name(value1, value2, …)” where “function_name” is the name of a function and “value 1” is one of the many values that a function can take as an input.
Collections library: The collections library contains different functions that are generally used by a user to do tasks that revolve around specialized container datatypes. A datatype is the kind of value (numerical, textual, Boolean etc.) that a variable can store. Variables are used in programming languages to temporarily store a certain value. A specialized container datatype derives its datatype from a complex combination of different variables having different datatypes. For example, ‘dict’ is a container datatype that stores a key-value pair where ‘key’ is a text and ‘value’ is a number, such as ‘Boy-10’.
Operator library: The operators library contains different functions that are generally used by the user to performs tasks such as value comparisons (greater than, smaller than, equal to, etc.), logical operations, mathematical operations (addition, subtraction, etc.) and sequential operations.
There are four major steps that are executed using these libraries in Python 3.0 to achieve word frequencies:
2.1 Removal of punctuations marks and conversion of the text to lower case: The first step of extracting words from a sentence or a large sample text is to remove all elements of a sentence except for words and the spaces between them. Punctuation marks (such as full stops, commas, question marks, etc.), if not removed before processing the text, will lead to words being counted and represented incorrectly. Python’s in-built ‘strip()’ function allows the user to remove all punctuation marks before and after every word in the sample text, making the text consistent throughout. Similarly, the ‘lower()’ function is used to convert all capitalized alphabets into lower case format to avoid same words such as “Hello” and “hello” to be presented as two different words.
Incorrect representation: Words ending with a comma such as “He said, hello…” will be presented to the user as “said,” (with an attached comma) instead of “said”.
Incorrect counting: Same words appearing twice but with and without a punctuation mark respectively will be represented as two different words. In the phrases “He said, hello…” and “He said that …”, the word “said” will be represented as “said” with a wordcount of 1 and “said,” with a wordcount of 1 instead of “said” being shown with a wordcount of 2.
2.2 Creation of a word list: The idea here is to separate out each word from the whole sample text such that all different words can be counted one by one. This is achieved by making a list of all the keywords occurring in the text. For example, if the sample text contains the phrase “hello world”, and both words are keywords, the word list of this line would consist of two words “hello” and “world”, such that each word can be separately identified and stored by the computer. The ‘word.strip()’ function in Python is used to create this word list, which takes the complete sample text as an input and produces the word list as an output. The ‘word.strip()’ function identifies spaces between two words to demarcate the ending of a preceding word and the beginning of a succeeding word. This process is followed throughout the sample text to identify different words in the order of their occurrence.
2.3 Count frequency and sort the words in descending order: The ‘counter()’ function from the collections library is used to determine the frequency of different words in the created word list. This function performs a simple and quick tally of all the keywords within the word list. The ‘counter()’ function starts at the first word of the created word list and looks for the same word in the selected text. Each time ‘counter()’ encounters the same word, it increases the word’s frequency by one. The ‘counter()’ stores each key-value (or word-frequency) pair in the form of a dictionary or ‘dict’. The ‘itemgetter()’ and ‘sorted()’ functions are used to sort the created dictionary in descending order.
2.4 Merge words with same base words: Words having the same base word, such as “tech” for “technology” and “technological” are combined together manually to derive the final frequency.
3. Application of methodology
The above method has been used by us to derive the word frequencies for the past five union budget speeches and one interim budget speech. These include Nirmala Sitharaman’s maiden budget speech in 2019, one speech by former finance minister Piyush Goyal in 2019 and four speeches by former finance minister Arun Jaitley from 2014 to 2018. The purpose of this project is to apply simple analytical tools to detect patterns and derive fresh insights from these budget speeches, which would otherwise not be explicitly visible to a reader. For instance, an increase in the usage of certain terms, such as technology, ‘Digital India’ etc. in the budget over a period of time could indicate that digitalisation and promotion of technology have been matters of rising importance for the government for that period. The project also attempts to determine if the importance given to certain sectors under the budget speeches translates to a tangible difference in the growth or promotion of such sectors. The project hopes to answer questions such as “Has the increase in the use of the phrase “investment promotion” in the budget speeches for five years led to more investments in India over a same period of time?” and so on. Such an analysis provides an objective manner of assessing the central government’s commitment to its budget proposals in the long term, allowing for a better appreciation of its role in enabling economic growth.
(Authored by Vihang Jumle, Associate, Ikigai Law with inputs from Tuhina Joshi, Policy Associate; and Anirudh Rastogi, Founder at Ikigai Law.)
[1] For the purposes of this project, the term ‘word frequency’ has been used to denote the number of times a word occurs in a particular text.
Leave a Comment