1. Preliminary
1.1 ‘Budget Speeches Analysis’ is Ikigai Law’s project initiated to (i) identify trends in the word frequency[1] of different keywords in India’s budget speeches, and (ii) compare them against trends in different economic parameters particular to India. This project applies simple analytical tools to detect patterns and derive fresh insights from budget speeches, which would otherwise not be explicitly visible to a reader. It also attempts to determine if the importance given to sectors under the budget speeches translates to a tangible difference in the growth or promotion of such sectors in the ground. In this post, we will identify trends in the word frequency of the terms ‘data’, ‘digital’ and ‘internet’ and compare it against different metrics that track the growth of digital technologies in India. This is the fourth post and final in the ‘Budget Speeches Analysis’ series. The third post can be found here.
2. Scope of analysis
2.1 This post analyses the word frequency of ‘data’, ‘digital’ and ‘internet’ in India’s union budget speeches. It maps the sum of word frequency of these words against the actual number of Indians who are on the internet, use fixed broadband subscription in India and use mobile cellular subscription in India. The sum of the word frequency of ‘data’, ‘digital’, ‘internet’ is referred to as ‘Words’ in this post. The interim budgets rarely used these words and hence their analysis has been dropped from this post. The time period for analysing the use of digital technologies has been set from 2017 to the earliest year until which reliable data was found. All values used in this post are measured on an annual basis. All union budget speeches of India can be found here. All data used for this project is sourced from World Bank’s Open Data project and can be found here, here, here, and here.
2.2 This post also builds a regression model[2] to determine the mathematical relationship between the number of individuals that access the internet in India and the number of mobile cellular subscriptions in India.
3. Data source
3.1 The project requires the data to be available on an annual basis, summed up for all Indian sectors and over a longer period of time for the analysis to be fruitful. India’s national data sources though meet the first two criteria, they do not provide this data for a long enough time period. The project hence relies on data provided by the World Bank’s Open Data project managed by the Development Data Group of the World Bank. The World Bank works closely with the international community such as the team at the United Nations, International Monetary Fund, regional banks, etc. to source its data.
4. Metrics used in this post
4.1 ‘Individuals using the internet’:
The World Bank’s Open Data project defines this metric as “[i]ndividuals who have used the Internet in the last 3 months. The Internet can be used via a computer, mobile phone, personal digital assistant, games machine, digital TV etc.” This metric has been used in the project because it tracks the number of people in India using the internet.
4.2 ‘Fixed broadband subscription’:
According to the World Bank’s Open Data project, the metric fixed broadband subscription refers to “fixed subscriptions to high-speed access to the public Internet (a TCP/IP connection), at downstream speeds equal to, or greater than, 256 kbit/s. This includes cable modem, DSL, fiber-to-the-home/building, other fixed (wired)-broadband subscriptions, satellite broadband and terrestrial fixed wireless broadband. This total is measured irrespective of the method of payment. It excludes subscriptions that have access to data communications (including the Internet) via mobile-cellular networks. It should include fixed WiMAX and any other fixed wireless technologies. It includes both residential subscriptions and subscriptions for organizations.” This metric has been used in the project because it tracks the use of broadband in India.
4.3 ‘Mobile cellular subscriptions’:
According to the World Bank’s Open Data project, mobile cellular subscriptions refer to “a public mobile telephone service that provide access to the PSTN using cellular technology. The indicator includes (and is split into) the number of post-paid subscriptions, and the number of active prepaid accounts (i.e. that have been used during the last three months). The indicator applies to all mobile cellular subscriptions that offer voice communications. It excludes subscriptions via data cards or USB modems, subscriptions to public mobile data services, private trunked mobile radio, telepoint, radio paging and telemetry services.” This metric has been used in the project because it tracks the mobile sim subscribers in India.
5. Results
The following graphs compare the trend in the word frequency of the Words and the trend in all three metrics used in this project.
5.1 The trend in the word frequency of the Words is compared to the trend in all three indicators for each year over a period of five years (2013 – 2017) in the following graph. It is observed that the growth in fixed broadband subscriptions remained muted whereas mobile cellular subscriptions and individuals on the internet rose over the period of five years. The word frequency of Words also increased drastically over the period of five years. This indicates that an increase in the use of the Words translated in more cellular subscriptions and internet penetration on the ground.
5.2 The trend in the word frequency of the Words is compared to the trend in individuals on the internet (% of the population) for each year over a period of 26 years (1992 – 2017) in the following graph. It is observed that growth in individuals on the internet (% of the population) gained momentum only after 2009. It is also observed that the word count of the Words drastically increased 2014 onwards. Both variables[3] appear to be strongly correlated. The Pearson correlation coefficient[4] is found to be 0.8 which indicates a strong positive correlation hence pointing at a possible relationship between the two variables. It is also observed that growth in this variable between 2014 and 2017 (3 years) is approximately the same between 2004 and 2014 (10 years). The National Democratic Alliance – I (“NDA-I”) came to power in 2014 whose policies may have accelerated the adoption of internet in India compared to The United Progressive Alliance (“UPA”) that formed the government between 2004 and 2014. This indicates that NDA-I focussed heavily on increasing internet penetration in the country.
5.3 The trend in the word frequency of the Words is compared to the trend in mobile cellular subscription (% of the population) for each year over a period of 26 years (1992 – 2017) in the following graph. It is observed that growth in mobile cellular subscription (% of the population) gained momentum only 2006 onwards. All in all, both variables appear to be moderately correlated. The Pearson correlation coefficient is found to be 0.58 which indicates a moderate positive correlation. This may imply that the two variables are likely independent and have minimal impact on each other.
5.4 The trend in the word frequency of the Words is compared to the trend in fixed broadband subscriptions (% of the population) for each year over a period of 17 years (2001 – 2017) in the following graph. It is observed that growth in fixed broadband subscriptions (% of the population) gained momentum only 2006 onwards. Hence, both variables appear to be only moderately correlated. The Pearson correlation coefficient is found to be 0.49 which indicates a moderate positive correlation. This may imply that the two variables are likely independent and have minimal impact on each other.
Although it is observed that the word frequency of the Words drastically increased after the NDA-I government came to power, it may not be directly attributed to kickstarting the adoption of the internet, mobile cellular subscriptions or broadband since growth in all these variables picked up by 2009. However, it may have accelerated the adoption of all these variables.
The following graphs focus on studying the relationship between the variables, ‘individuals on the internet’ and ‘mobile cellular subscriptions’.
5.5 The trend in the individuals on the internet (% of the population) is compared to the trend in mobile cellular subscription (% of the population) for each year over a period of 26 years (1992 – 2017) in the following graph. It is observed that both variables grew smoothly in value over the observed period of time. Both variables appear to be strongly correlated. The Pearson correlation coefficient is found to be 0.9 which indicates a strong positive correlation hence pointing at a possible relationship between the two variables. This stands to reason, because with increased access to cheaper smartphones, both cell phone subscription and internet penetration were bound to increase in the country.
5.6 Since it is intuitive to say that people using the internet will likely increase with an increase in mobile cellular subscription, a scatterplot[5] of the data with a trendline is plotted below. The scatterplot validates the hypothesis that the individuals on the internet appear to rise with increase in mobile cellular subscriptions.
5.7 Both variables are transformed using the natural logarithmic () function to construct an approximate linear relationship. A linear regression model is built[6] to identify a cause-action relationship between the two variables. Here, the logarithm of ‘mobile cellular subscriptions’ is considered to be the independent variable (a variable that can be controlled and changed to observe its effect on a dependent variable) whose relationship is determined with the logarithm of ‘individuals on the internet’ which is the dependent variable (a variable that is dependent on the independent variable and responds to a change in the value of the independent variable). A scatterplot of transformed variables with the regression line is plotted in the following graph. The residuals are also observed to follow a normal distribution[7][8].
The relationship between the transformed variables can be expressed as:
In simpler terms, this indicates that there is a cause-action relationship between these two variables likely exists.
6. Conclusion
This analysis tells us that there might be a slight relationship between the word frequency of the Words mentioned in budget speeches and the variables used in this analysis. The analysis may also suggest that the NDA- I government accelerated the adoption of mobile phones and internet in India more than the previous UPA government.
(Authored by Vihang Jumle, Associate, Ikigai Law with inputs from Tuhina Joshi, Policy Associate; Ratul Roshan, Policy Associate; and Anirudh Rastogi, Founder at Ikigai Law.)
[1] For the purposes of this project, the term ‘word frequency’ has been used to denote the number of times a word occurs in a particular text.
[2] A regression model is used to estimate a cause-action relationship between independent variables.
[3] A variable here is defined as a dummy that represents an indicator which is being compared on the graph. Here, word count and other indicators are referred to as variables.
[4] A Pearson correlation coefficient of 1 indicates a strong positive correlation whereas -1 indicates a strong negative correlation.
[5] Scatterplot is a chart that uses dots to represent the values obtained from two variables.
[6] Coefficient (individuals on the internet) estimate = 0.7175, Standard error = 0.0318, p-value = 3.37e-16; Coefficient (intercept) estimate = 4.2181, Standard error = 0.57507, p-value = 3.22e-07; Residual standard error = 0.4544; Adjusted R-squared = 0.9584; F-statistic = 508.3
[7] Shapiro-Wilk normality test; p-value = 0.2616; Null hypothesis = Sample comes from a normal distribution.
[8] A residual is the difference between the estimated value and the real sample value. Normally distributed residuals imply that expected errors of the model do not change at different levels of the dependent variable. This also implies that the independent variable’s accuracy approximately remains the same at different levels of the dependent variable. Thus, making the model reliable for making accurate predictions.
Leave a Comment