Instrumental Variable Analysis in Atmospheric and Aerosol Chemistry

Due to the complex nature of ambient aerosols arising from the presence of myriads of organic compounds, the chemical reactivity of a particular compound with oxidant/s are studied through chamber experiments under controlled laboratory conditions. Several confounders (RH, T, light intensity, in chamber retention time) are controlled in chamber experiments to study their effect on the chemical transformation of a reactant (exposure variable) and the outcome [kinetic rate constant determination, new product/s formation e.g., secondary organic aerosol (SOA), product/s yield, etc.]. However, under ambient atmospheric conditions, it is not possible to control for these confounders which poses a challenge in assessing the outcome/s accurately. The approach of data interpretation must include randomization for an accurate prediction of the real-world scenario. One of the ways to achieve randomization is possible by the instrumental variable analysis (IVA). In this study, the IVA analysis revealed that the average ratio of fSOC/O3 (ppb−1) was 0.0032 (95% CI: 0.0009, 0.0055) and 0.0033 (95% CI: 0.0001, 0.0065) during daytime of Diwali and Post-Diwali period. However, during rest of the study period the relationship between O3 and fSOC was found to be insignificant. Based on IVA in conjunction with the concentration-weighted trajectory (CWT), cluster analysis, and fire count imageries, causal effect of O3 on SOA formation has been inferred for the daytime when emissions from long-range transport of biomass burning influenced the receptor site. To the best of our knowledge, the IVA has been applied for the first time in this study in the field of atmospheric and aerosol chemistry.


INTRODUCTION
Instrumental variable analysis (IVA) has been utilized for understanding the causal effect of election day precipitation on electoral outcomes affecting the voters' decisions differently (Lind, 2020). IVA has also been utilized in "remote sensing" e.g., in estimating error of a geophysical product (Dong et al., 2019). IVA approach has been utilized too in "epidemiology" e.g., to provide the effectiveness of a high-dose over a standard-dose in protecting against influenza/pneumonia associated hospitalization, cardiorespiratory-associated hospitalization, and all-cause hospitalization, among others (Young-Xu et al., 2019). IVA has also been applied in the field of social science and statistics e.g., utilization of genetic markers as IV for assessing the effect of the risk factor on the outcome (von Hinke et al., 2016). However, to the best of our knowledge the application of IVA has not been done so far to assess feasibility of chemical transformations occurring in the atmosphere through gas-phase or multi-phase reactions. There are myriads of organic and inorganic compounds in the gas-phase in the ambient atmosphere (Zhang et al., 2019). Some of those compounds undergo chemical transformation owing to their reactivity with atmospheric oxidants/reactants (O 3 , H 2 O 2 , NO x, and OH radical) and may undergo a phase-change [e.g., secondary organic aerosol (SOA) formation] (Sato et al., 2013;Edwards et al., 2017). The chemical reactivity of a reactant and the formation of product/s both are influenced by several factors called confounders e.g., relative humidity (RH), temperature (T), and solar flux (SF), among others. Having measured a SOA species and an oxidant (per say O 3 ), if one attempts to find out the relationship between the two through regression analysis then the regression parameters will not be representative of the only effect of O 3 on SOA as there could also be the effect of confounders and other oxidants too. Thus, measuring reactants and products and simply relating the two does not mimic the actual effect of the reactant on the product. This drawback arises because of not achieving the randomization. The randomization makes exposure (here reactants) independent of measured/unmeasured confounders (here RH, T, SF). This can be achieved by different causal modeling approaches, one of which is the IVA (Hernán and Robins, 2020).
Broadly, there are two methods of causal modeling (Hernán and Robins, 2020). The first one is called propensity score methods that use the statistical analytics tool to make exposure independent of confounders (i.e., exchangeable). Another method uses quasi-experimental designs to achieve the exchangeability. While the propensity score method can control for measured confounders and rely on the assumption of no unmeasured confounders (are the ones that affect both exposure and the outcome, please refer to Figure 1), and no missing interaction terms, the quasi-experimental designs (e.g., IVA) can also control for unmeasured confounders, and rely on the assumption that the randomization created by the design worked. The propensity score (PS) is the predicted probability of exposure. Briefly, for a robust interpretation in the PS method one would need the data of all possible confounders to input into the model. PS calculation requires a set of methods to operate e.g., matching (nearest-neighbor matching, radius matching, and kernel matching), stratifying, adjusting for covariate in the regression model, and weighting. Finally, using the PS method the standardized mean differences are calculated for all variables that were input into the model (Schwartz et al., 2018). For continuous confounders the standardized mean difference is calculated as below: And for dichotomous confounders, the standardized mean difference is calculated using the following equation: Here, treatment and control groups represent reactive and nonreactive species. The "x mean " and "â" represent the average of the data whereas "S" represents the standard deviation. For details on PS calculation, reference is made to Hernán and Robins (2020). The critical aspect of the propensity score method arises when there are unmeasured confounders. That is, if all the confounders are not inputted into the model then the PS method could underestimate/overestimate the effect of exposure on outcome. In quasi-experimental designs, there can be four different approaches: Instrumental Variable; Regression Discontinuity; Difference in Differences; and Natural Experiments (Hernán and Robins, 2020).
Briefly, the quasi-experimental designs seek to randomize exposure independent of confounders in different ways, which are not statistical manipulations (Schwartz et al., 2017(Schwartz et al., , 2018. For details of statistical method and quasi-experimental designs, reference is made to (Hernán and Robins, 2020). IVA exploits the presence of an additional variable in the data, called an instrumental variable to achieve randomization. The instrumental variable plays the role of the random assignment.

THEORY OF IVA
Let us consider a directed acyclic graph (DAG) illustrating the instrumental variable (Z) analysis approach (Figure 1). Let A be the exposure, C be the confounders, and Y be the outcome variable. It is important to mention here that the association between Z and outcome Y is not confounded by C (measured/unmeasured confounders). Thus, by calibrating the instrument Z to A, the causal effect estimate of per unit increase in A on Y can be determined using regression analysis. To understand these terms in the context of aerosol chemistry an example could be given by considering exposure (A) by O 3 , confounders (C) by relative humidity (RH), temperature (T), and solar flux (SF), the instrumental variable (Z) by BLH or wind speed, and the outcome (Y) by the SOA fraction. Now, we shall discuss the mathematical perspectives of IVA. Let Y A=a t , Y A=a t ' be the potential outcomes at observation t under the exposures a and a'. Now, if Y A=a = Y A=a ' then it is inferred that A has a causal effect on Y. The causal estimates can be represented mathematically by two parameters as below: Summing up, under two different exposure conditions if we get different outcomes then there is a causal effect of exposure on the outcome. Let us suppose that the outcome depends on predictors in the following manner: Where η t represents the impact of all other variables on the outcome, φ 0 is intercept, and φ is the coefficient of A t . Suppose we have a variable Z such that Z is associated with Y only through A. Its association with A can be expressed as: Where, τ t represents the variations in exposure that are associated with other predictors of outcome, measured or unmeasured and σ is the coefficient of Z t . Z t σ is the variation in exposure due to the instrument, and hence independent (by assumption) of confounders. Let Z1 be the Z such that: and similarly, Here, E(A|Z1) = a represents that the expected value of exposure variable A at Z = Z1 is a, and E(A|Z2) = a' represents that the expected value of exposure variable A at Z = Z2 is a'. From Equations 5 and 7, it can be derived that: Similarly, from Equations 5 and 8, it can be derived that: Subtracting Equation 10 from Equations 9 would yield: Here, φ, which is a log rate ratio, is the causal effect of exposure on outcome in which we are interested. Summing up, first regress exposure against the instrument and then regress the outcome against instrument to get the causal effect estimate. The ratio of the two coefficients is the causal effect of a unit change in exposure. The instrument Z should explain a part of the variation in A that is not confounded by other predictors of the outcome. It is worthwhile mentioning that in IVA the confounders (measured/unmeasured) do not play any role while estimating the effect of exposure on the outcome. This is one of the major advantages of IVA over other quasiexperimental designs as often we do not have the data set of all confounders in atmospheric and aerosol chemistry research.

Hypothesis Under Examination
While assessing the effect of O 3 on SOA formation (e.g., here, f SOC = SOC/OC ratio), it would be important to notice the variability of f SOC as a function of O 3 . If the behavior of outcome (f SOC ) is linear to the predictor (O 3 ) then it is possible to build a linear regression model between the two as follows: Here, β 0 is the intercept, β 1 is the slope, and ε is the residual. A normal-distribution is observed for residuals [ε ∼ N (0, σ 2 )] in linear regression analysis (i.e., with a mean zero and variance σ 2 ). The null and alternate hypotheses for the above linear regression model are given below: Null Hypothesis(H 0 ) : β 1 = 0 Alternate Hypothesis(H a ) : β 1 = 0 Only if null hypothesis is rejected, we can get a causal estimate of O 3 on SOA formation. We seek the cases where the null hypothesis is rejected as it leads toward the causal effect estimate.

Randomized Experiments and Exchangeability
The major issue while inferring causality arises due to unmeasured parameters. Inferring causality for each data point means coming up with a valid surrogate for the unobserved counterfactual, which is hard to imagine. So, to infer causality, the randomization (exchangeability) experiment is conducted. The randomization process makes the exposure independent of confounders (measured/unmeasured). Achieving randomization means that the distribution of confounders is the same in unmeasured and measured data points. Statistically it can be understood by assuming that outcome (Y a=1 ) for the data point which happened to be measured is the same as outcome (Y a=1 ) for the data point which was not measured. After randomization, the Pr(Y = 1|A = 1) is a consistent estimate of Y a=1 (where Pr is a conditional probability of outcome under A = 1). Thus, if the probability of the outcome [or E(Y)] among measured data points is the same as the probability [or E(Y)] among the entire population then the exposures are exchangeable. The exchangeability is achieved by random assignment. Exchangeability can be achieved in several ways, one of which is the IVA.

Good Instrument for a Valid IVA Inference
There are two essential requirements for IVA inference to be valid: i. The instrument needs to explain enough variation of exposure variable A and its average causal effect on A should be non-zero. ii. The instrument must not have a direct causal effect on outcome Y.
The most important question while performing IVA in atmospheric and aerosol chemistry is like what would be a good instrument to explain the chemical transformation in a study? For example suppose a research problem focuses on the formation of SOA through oxidation of VOCs under ambient atmospheric conditions. Furthermore, the researchers are interested in finding the effect of O 3 on the SOA formation estimate. This research problem seems to be confounded by the RH, T, and solar flux (for DAG refer to Figure 1). Having understood all these backgrounds if a good instrument can be found then IVA can be performed and the causal inference on SOA formation due to per unit change in O 3 can be determined as discussed above. For most of the atmospheric and aerosol chemistry problems, the authors encourage to utilize either boundary layer height (BLH, aka mixing height) or wind speed as an instrument. It is worthwhile mentioning that BLH and wind speed affect the concentration of pollutants, but will not change the relative abundance (ratio) of any two species (e.g., outcome here, f SOC = SOC/OC). The validity of the results should be checked with p-value (level of significance) and model output confidence interval (CI: 95% or 99%). More importantly, the NULL hypothesis of a problem under the study must be checked with the CI values. The validity of the instrument needs to be checked using the machine learning algorithm. The important caveat of IVA is that under some research problems the instrument predicts changes in a subset of exposure population. In these scenarios, the instrument will provide the average treatment effect for the subset of the data and not for the entire population. This is called the local average treatment effect. This must be investigated for the problem under investigation. For example, in atmospheric science it is likely that sometimes we may get an extremely high or a very low value of exposure variable (here, O 3 ). Then it is required to consider those extreme high/low data points as outliers and re-run the model for the output.

Field-Campaign
A field campaign was conducted from 08th Nov 2015 (local time: 10:00 h) to 16th Nov 2015 (local time: 06: 00 h) at Kanpur location (26.30 • N, 80.14 • E, 142 m asl.), situated in central part of the Indo-Gangetic Plain (IGP). In this campaign, we have measured organic and elemental carbon (OC-EC), near groundlevel O 3 , and mass concentrations of PM 2.5 . Briefly, the analytical instruments were housed in Atmospheric Particle Technology Laboratory (APTL, first floor, ∼25 ft. above the ground level) at the Center for Environmental Science & Engineering (CESE) building in Indian Institute of Technology Kanpur (IITK) premises. The entire data set has been sub-divided into three periods: Pre-Diwali (08th to 10th Nov.), Diwali festival (11th Nov.), and Post-Diwali (12th to 16th Nov.). The OC-EC has been measured using a Sunset laboratory semi-continuous Carbon analyzer using NIOSH (National Institute for Occupational Safety and Health) protocol (Birch and Cary, 1996;Rajput et al., 2017). O 3 has been measured on an ozone analyzer (Thermo Scientific; Model # 49i) by UV-photometric technique. Particles number concentrations (PNC) were measured in the aerodynamic size-range of 0.25-2.5 microns on an Aerosol Spectrometer (PALAS, Germany). Assuming spherical shape and apparent density (1.1 g cm −3 ), the PNC (#/cm 3 ) was converted into mass concentration (µg m −3 ) following the approach given in earlier publications (Pitz et al., 2003;Izhar et al., 2018;Rajput et al., 2019). These instruments were kept nearby on a platform in the lab, and their inlets (separated by <1 m) were drawing air from the ambient atmosphere through a window located in the rear-side of the CESE building. Details on instrumentation and analysis are given in the Supplementary Material.
In this study, the secondary organic carbon (SOC) was estimated by the minimum OC/EC ratio method using the below-given equations (Castro et al., 1999;Ram et al., 2008;Rajput et al., 2018).
Here, OC prim refers to primary organic carbon whereas (OC/EC) min refers to the minimum OC/EC ratio. The minimum OC/EC ratio was 3.2, 3.4, and 5.0 during Pre-Diwali, Diwali, and Post-Diwali periods, respectively. The SOC/OC ratio has been referred to as f SOC (fraction of SOC in total OC) and POC/OC has been referred to as f POC (fraction of POC in total OC) in the subsequent discussion.

Retrieval of the Meteorological Data Set
Meteorological parameters viz. ambient temperature (T), wind, and relative humidity (RH) have been retrieved from the oncampus weather station. Boundary layer height (BLH), solar flux (SF), and 5-days air-mass back trajectories (AMBTs, GDAS 0.5 • , @ 1,000 m above ground level) data have been retrieved from NOAA HYSPLIT (Hybrid Single-Particle Lagrangian Integrated Trajectory) model (Draxler and Rolph, 2003;Stein et al., 2015). AMBTs have been coupled with PM 2.5 mass concentration data for cluster and concentration-weighted trajectory (CWT) analysis (discussed in the subsequent section).

Data Processing in R
Data analysis, statistical analysis, instrumental variable analysis (IVA), and machine learning coding have been carried out in R (Carslaw and Ropkins, 2012;Schwartz et al., 2017Schwartz et al., , 2018Hernán and Robins, 2020). The OC/EC ratio varied from 3.2 to 18.5 during the entire study period. The daytime and nighttime average values of OC/EC has been constrained from linear regression analysis (Figure 3). The OC/EC ratio averaged at 4.4 ± 0.2 during the Pre-Diwali daytime period (08th−10th Nov) and was found statistically different as compared to that observed during Pre-Diwali nighttime period (Avg. ± SD = 6.5 ± 1.3; t-value = 11.7, Table 1, Figure 3). Likewise, on Diwali festival event (11th Nov) the daytime (5.1 ± 0.6) and nighttime average values (6.6 ± 1.2) of OC/EC ratio were statistically different (t-value = 5.2). Furthermore, during the Post-Diwali period (12th−16th Nov) the daytime average ratio of OC/EC (9.3 ± 1.5, Figure 3) also looks statistically different as compared to that observed at nighttime (11.0 ± 1.6, t-value = 7.7). Summing up, PM 2.5 , OC, and OC/EC ratio were highest during the Post-Diwali period and followed the following trend: Pre-Diwali < Diwali < Post-Diwali. However, EC varied a little during the study period. The higher OC/EC ratio (9-11), during Post-Diwali period plausibly indicated for their predominant emission from biomass burning and/or SOA formation (Rajput et al., 2013. To

05) value for variables determined through linear regression analysis. Statistical two-tailed t-test showing significant difference (p < 0.05) in the compared values is shown in bold (t-values). Non-significant difference (p > 0.05) in the compared values are shown in regular font (t-value). a Data measured in this study. b Data retrieved from on-campus (@IIT Kanpur) weather station. c Data retrieved from NOAA.
further collect the evidence about this we have carried out concentration weighted trajectory and cluster analysis. We have also retrieved daily fire count imageries for the entire study period from Moderate Resolution Imaging Spectroradiometer sensor (MODIS, onboard NASA Terra and Aqua satellite). We have discussed striking observations revealed from these evidence in subsequent sections.

Concentration Weighted Trajectory (CWT) and Cluster Analysis
We have used an open-source GIS-based software viz. TrajStat to carry out the CWT and Cluster Analysis (Figure 4). The application details about CWT and cluster analysis can be found elsewhere (Wang et al., 2009). Briefly, CWT is a receptor-based model and is widely used in conjunction with the cluster analysis to understand the ambient level of air pollutants along with the cluster of trajectories influencing the receptor site (Bansal et al., 2019). Using the CWT and cluster analysis we have attempted to show local/regional source-areas of PM 2.5 potentially affecting the receptor site. To perform CWT and cluster analysis over the study region, a 5-days air-mass back trajectories (AMBTs, GDAS 0.5 • m, @ 1,000 m above ground level) data are coupled with PM 2.5 mass concentration data. In the CWT method, each grid cell is assigned a weighted concentration by averaging the sample concentrations based on associated trajectories crossing that grid cell, using the mathematical formula as given below: where C ij is the average weighted concentration of PM 2.5 in the grid cell (i, j), M is the total number of trajectories, l is the index of the trajectory, C l is the concentration observed at receptor site on the arrival of trajectory l, and τ ijl is the trajectory residencetime (time spent) of trajectory l in the grid cell (i,j). A higher value of C ij implies that trajectories over the grid cell (i, j) would be associated with higher concentrations. In the cluster analysis method, the measured PM 2.5 mass concentrations were assigned to the corresponding trajectories and the nearest trajectories were clustered according to an angle distance function (Sirois and Bottenheim, 1995). One of the observations from Figure 4 relates that the PM 2.5 levels were relatively high in the north-west (NW) direction of the receptor site (study location, central IGP). Thus, it is expected that if a trajectory cluster is passing more frequently through NW-direction to the receptor site then it will be associated with elevated levels of PM 2.5 (and associated species) as compared to the case when trajectories were arriving from other directions.
Let us now analyze the clusters shown in Figure 4. During Pre-Diwali period (Figure 4A), three sets of clusters have been observed: cluster-I (65%) originated from the west and then traversed through SE to the site, cluster-II (24%) also originated from the west and then traversed through SE to the site but all the time remained closure to the grid-cell of the receptor site, and cluster-III (11%) originated from the west and then traversed through NW to the site. During Diwali (Figure 4B), only two sets of clusters have been observed: cluster-I (54%) showing impact from north-westerly air-masses at the site, whereas cluster-II (46%) originated from the south and then traversed through NW. Thus, based on cluster analysis in conjunction with CWT, relatively high concentrations of PM 2.5 (and associated OC, and OC/EC ratio) during Diwali as compared to those during Pre-Diwali can be explained to more influence of north-westerly polluted air-masses during the Diwali event. During the Post-Diwali period (Figure 4C) also, only two sets of clusters were observed: cluster-I (81%) showing impact from north-westerly air-masses at the site, whereas cluster-II (19%) showed the impact of westerly air masses. Thus, it can be summarized that due to the predominant impact of north-westerly air-masses during the Post-Diwali period the concentrations of PM 2.5 , OC (one of the major fractions of PM 2.5 ) and OC/EC ratio were higher followed by those on Diwali and then on Pre-Diwali period.
To further analyze the plausible reason for showing the elevated levels of air pollutants while air-masses arriving through NW-direction we have retrieved fire count imageries which are discussed below.

Fire Count Imageries Over Indo-Gangetic Plain (IGP)
Moderate Resolution Imaging Spectroradiometer sensor (MODIS, onboard NASA Terra and Aqua satellite) imageries capturing fire count data (spatial resolution: 10 km) from intense open biomass (agricultural-waste paddy-residue) burning over IGP during the study period are shown in Figure 5. Looking at the fire counts data it reveals that agricultural-residue fire activity was relatively intense on 09th, 11th, 12th, 14th, and 16th of Nov, moderate on 10th, 13th, and 15th of Nov, and low on 08th of Nov ( Figure 5). Furthermore, it is obvious from the figure that open biomass burning was active in the upwind IGP region (mainly in north-west direction of the sampling site) during the entire study period (08th to 16th November 2015). This observation is quite consistent with earlier observations reporting massive emissions of air pollutants from agricultural-waste burning in upwind IGP (Rajput et al., 2014. It is worthwhile mentioning that 100s of million tons of paddy-residues are burned every year for the crop-rotation during October-November months in the north-western region of IGP (Rajput et al., 2014). Summing up, clubbing the data set from our campaign, satellite observations on fire counts along with the CWT and cluster analysis it can be summarized that long-range transport of air masses (biomass burning emissions) from NW-direction (upwind IGP) was responsible for elevated levels of air pollutants at the receptor site mainly during Post-Diwali and Diwali period. Before we

Research Background on SOA Formation
Chemical transformations of volatile organic compounds (VOCs) can result in the formation of secondary organic aerosols (SOA) and photochemical smog (Volkamer et al., 2006). SOA formation can broadly occur via two pathways: gas-or aqueous-phase oxidation/nitration of VOCs. The oxidation reactions of VOCs can take place with OH radicals, O 3 , and H 2 O 2 whereas nitration reactions take place with NO 3 radicals. Several studies have compared oxidation/nitration for O 3 , OH, and NO 3 radicals. For example, daytime SOA formation through gas-phase oxidation of VOCs (e.g., isoprene, monoterpene) has been reported dominantly by OH radicals and O 3 (Sato et al., 2013;Edwards et al., 2017). However, nighttime dominant SOA formation through gas-phase reactions has been reported by O 3 and NO 3 radical (Sato et al., 2013;Edwards et al., 2017). Furthermore, NO X and O 3 react with C=C type moieties whereas OH reacts with compounds containing C=C and other saturated though reactive compounds (Palm et al., 2017). Moreover, another oxidizing agent, which oxidizes VOCs efficiently both during the day and nighttime is H 2 O 2 . A recent study focusing on aqueous-phase SOA led by H 2 O 2 during the day (UV aging) and nighttime (dark aging) have brought a new set of information: daytime aqueous-phase oxidation product contains large water clusters [(H 2 O) n OH, n = 14-43], heavier oligomers (m/z: 329-383) whereas nighttime aqueous-phase oxidation products contain small water clusters [(H 2 O) n OH, n = 1-13], small oligomers (m/z: 33-261), and cluster ions (m/z: 333-405) (Zhang et al., 2019). Summing up, the oxidizing agents can led to SOA formation through different pathways. However, the important aspect to mention here is that O 3 has been reported by several aforementioned studies to serve as a potential oxidizing agent both during the day and nighttime.
Let us now discuss regional observations based on several previous studies from Indo-Gangetic Plain (IGP). The IGP experiences high-to-severe air pollution during post-monsoon (Oct-Nov) through winter (Dec-Feb). During the postmonsoon, urban background air pollution in central IGP is regulated mainly by vehicular emission, industrial emission, coal-based power plants, bio-fuel burning, and uplifted mineral dust (Rajput et al., 2016;. Furthermore, long-range transport of pollutants from source-region (in upwind IGP) of seasonally active post-harvest agricultural-waste burning emissions of paddy-residues makes the air quality further poor in central and downwind locations (Rajput et al., 2014). There have been several attempts on measurements and modeling of various air pollutants, secondary organic aerosol (SOA) formation, and radiative forcing estimates across IGP, e.g., (Ram et al., 2008;Srivastava and Ramachandran, 2013;Rajput et al., 2018;Dumka et al., 2019;Srivastava et al., 2019;Mhawish et al., 2020;Mishra and Kulshrestha, 2020;Satish et al., 2020). Association between pollutants has been reported from the above studies in particular and other studies conducted across the globe in general. However, the causation of an exposure variable on an outcome in atmospheric and aerosol chemistry space has never been studied before (to the best of our knowledge).
Diurnal Variability of f SOC , f POC , O 3 , BLH, RH, and T The diurnal variability of the fraction of SOC and POC in total OC represented as f SOC and f POC , respectively are shown in Figures 6A-C for Pre-Diwali, Diwali, and Post-Diwali periods. Accordingly, during Pre-Diwali period the average f SOC was relatively high during nighttime (0.39 ± 0.15, t-value = 2.0) as compared to that in daytime (0.34 ± 0.11, Figure 6A). During Diwali, daytime average value of f SOC (0.33 ± 0.07) was higher than that in nighttime (0.25 ± 0.12, t-value = 2.7). Similarly, during Post-Diwali, daytime average value of f SOC (0.44 ± 0.12) was also higher than that in nighttime (0.30 ± 0.07, t-value = 10.6). Thus, unlike the Pre-Diwali period, the average f SOC was higher in the daytime as compared to that in nighttime during Diwali and Post-Diwali periods (Figures 6B,C). We reiterate that the impact of long-range transport of pollutants from postharvest biomass burning emissions were observed during Diwali and Post-Diwali period. Thus, it is logical to infer that under the impact of biomass burning emissions, the f SOC value was observed higher in the daytime as compared to that in the nighttime. The pattern of f POC is also shown in Figure 6 but since it is trivial to estimate f POC (= 1-f SOC ) we do not discuss here about it in the detail.
We have also shown the diurnal variability of O 3 , BLH, RH, and temperature (T) during the study period (Figure 6).
The O 3 has a daytime peak during Pre-Diwali whereas during Diwali and Post-Diwali periods it exhibited an additional peak in the nighttime. The nighttime O 3 peak is attributable to the long-range transport of biomass burning emissions from NWdirection (upwind IGP). Furthermore, RH and T retained their variability pattern during the entire study period. Moreover, maximum BLH during Pre-Diwali (∼2,700 m, asl.), and Diwali (2,500 m asl.) were relatively high as compared to maximum BLH recorded during Post-Diwali (1,500 m, asl.). However, the diurnal variability pattern of BLH looks similar during the entire study period. Before we move to the next aspect, there are two important observations to notice for (from Figure 6): BLH variability appears to have an insignificant impact on f SOC pattern, and BLH explains, if not all, at least some of the variability of O 3 . This is exactly the scenario we were referring to in the above section wherein we discussed the validity of BLH to serve as IV to infer the causal effect of O 3 (exposure) on f SOC (outcome). Before we discuss the results from IVA analysis (all relevant equations given above), let us first look at the collinearity between f SOC and O 3 using linear regression analysis.

Correlation Analyses: O 3 vs. f SOC (= SOC/OC Ratio)
A 3-D correlation plot is shown in Figure 7 for three events (Pre-Diwali, Diwali, and Post-Diwali) and two periods (daylight and nighttime). O 3 (ppb) is plotted on the X-axis, f SOC on the Yaxis, and the BLH data set is plotted on the z-axis (m, shown on a scale at the right hand side of the plot). The BLH appears to have an insignificant influence on the relationship between f SOC and O 3 . This further relates to our aforementioned observation on insignificant co-variability of f SOC as a function of BLH. For the linear regression analysis as shown in Figure 7, the level of significance was found to be quite satisfactory (i.e., p < 0.05) only for two cases: daytime of Diwali and Post-Diwali period. During Diwali daytime period, the ratio of f SOC /O 3 (ppb −1 ) averaged at 0.0036 ± 0.0015 (R 2 = 0.79, p < 0.05, Table 1, Figure 7). Furthermore, during the Post-Diwali period, the average ratio of f SOC /O 3 (ppb −1 ) was 0.0031 ± 0.0017 (R 2 = 0.42, p < 0.05, Table 1, Figure 7). Thus, during the daytime period marked with long-range transport of pollutants from biomass burning emission the association between f SOC and O 3 was found to be quite significant in this study. However, based on linear regression analysis, we cannot attribute that how much SOA (f SOC ) has been formed due to the O 3 during daytime of Diwali and Post-Diwali period? So, the next step would be to apply causal modeling to gain further insights on this aspect.

The Causal Effect of O 3 on SOA Formation
We have performed IVA analysis to assess the causal effect of O 3 on SOA formation in IGP. We have used a support vector machine (SVM, R package e1071) (Cortes and Vapnik, 1995;Team, 2013) with a radial kernel to estimate variation in O 3 (exposure) explained by the BLH (instrument). The SVM algorithm maximizes a 10-fold cross-validation. We have checked R 2 of the instrument (BLH) predicting exposure (O 3 ) to ensure that our instrument was strongly associated with the exposure. In this process of validation, for all the six cases (daytime and nighttime of Pre-Diwali, Diwali, and Post-Diwali) the R 2 varied between 0.84 and 0.93, thus suggesting BLH to serve as a good instrument to estimate the causal effect of O 3 on SOA formation. We have already discussed above in detail about the mathematical modeling of IVA to infer causal effect of O 3 on SOA formation (i.e., f SOC ). The results of the IVA analysis are shown in Table 2. Before we discuss the results, let us recall that only if the Null hypothesis is rejected (i.e., β 1 = 0), we can get a causal estimate of O 3 on SOA formation. Now, if we look at the 95% confidence interval values (CI, Table 2) then it can be noticed that only for the two cases i.e., daytime of Diwali and Post-Diwali period, the null hypothesis is rejected (i.e., β 1 = 0). Thus, causal estimate can be obtained for the aforementioned two cases only during the study period. The IVA analysis revealed that during Diwali daytime period, the average ratio of f SOC /O 3 (ppb −1 ) was 0.0032 (95% CI: 0.0009, 0.0055, p < 0.05, Table 2). Furthermore, during the Post-Diwali daytime period, the average ratio of f SOC /O 3 (ppb −1 ) was 0.0033 (95% CI: 0.0001, 0.0065, p < 0.05, Table 2). Thus, based on IVA it is inferred that O 3 has a causal effect on SOA formation during daytime while experiencing massive emissions from long-range transport of biomass burning emissions at the receptor site.

CONCLUSIONS
This study finds and discusses the application of instrumental variable analysis (IVA) for inferring causality in atmospheric and aerosol chemistry observations. The basic methodology of IVA has been explained herein through a set of equations and theories involved concisely. IVA is a quasi-experimental design that can control for measured as well as unmeasured confounders. BLH and wind speed have been suggested to serve as good instruments for inferring causality in atmospheric chemical transformations. The importance of randomized experiments to achieve exchangeability has been explained in terms of obtaining the same distribution of confounders in measured and unmeasured data points. The machine learning algorithm has been used to assess the validity of the instrument. On field-based measurements, the IVA has been applied to get a causal estimate of the outcome (e.g., SOA formation) due to the exposure (O 3 ). Accordingly, the average ratio of f SOC /O 3 (ppb −1 ) was found to be 0.0032 (95% CI: 0.0009, 0.0055) and 0.0033 (95% CI: 0.0001, 0.0065) during daytime of Diwali and Post-Diwali period. However, during rest of the study period the relationship between O 3 and f SOC was not significant. Inferences from IVA coupled with the CWT, cluster analysis, and fire count imageries revealed that O 3 has a causal effect on SOA formation during the daytime when emissions from biomass burning via long-range transport influenced the receptor site. The programming of IVA can be done using R, python, or any other language-based software.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/Supplementary Materials, further inquiries can be directed to the corresponding author/s.

AUTHOR CONTRIBUTIONS
IVA has been performed by PR. The manuscript has been conceptualized and written by PR. The field-campaign was planned and organized by PR and TG. All authors contributed to the article and approved the submitted version.