TAX DATA ANALYTICS; relational databases and ERPs in a tax setting/Modeling
Tax Modeling
• Tax modeling is a subset of financial modeling, where tax
plays an important (if not THE most important) role.
“Financial modeling is the task of building an abstract
representation (a model) of a real world financial
situation. This is a mathematical model designed to
represent (a simplified version of) the performance of a
financial asset or portfolio of a business, project, or any
other investment.
Modeling
• Typically, then, financial modeling is understood to mean
an exercise in either asset pricing or corporate finance, of
a quantitative nature. It is about translating a set of
hypotheses about the behavior of markets or agents into
numerical predictions. At the same time, "financial
modeling" is a general term that means different things to
different users; the reference usually relates either to
accounting and corporate finance applications or to
quantitative finance applications
Examples of financial modeling
• Business valuation, especially discounted cash flow, but
including other valuation approaches
• Scenario planning and management decision making
("what is"; "what if"; "what has to be done")
• Capital budgeting, including cost of capital (i.e. WACC)
calculations
• Financial statement analysis / ratio analysis (including of
operating- and finance leases, and R&D)
• Revenue related: forecasting, analysis
Examples of financial modeling
• Project finance modeling
• Cash flow forecasting
• Credit decisioning: Credit analysis and Consumer credit
risk; impairment- and provision-modelling
• Working capital- and treasury management; asset and
liability management
• Management accounting: Activity-based costing,
Profitability analysis, Cost analysis”
Tax modeling
• Tax modeling is frequently focused on
future financial projections where tax is an
important aspect. Such modeling can occur
with or without uncertainty built in.
Examples of Tax modeling
For individual tax, common examples include:
• choosing filing status (joint or married filing
separately)
• investing in a traditional IRA versus a Roth
IRA
• investing in taxable versus tax free bonds
• sell versus hold decisions for stock
Examples of tax modeling
(individuals)
• tax impacts of making charitable
contributions of property
• decisions on exercising and selling stock
options
• decisions assuming future changes in tax
rates or rules (e.g., taking a deduction now
versus later)
Examples of business tax
modeling
• tax impacts of Sec 754 elections
(partnerships)
• *tax impacts of special allocations decisions
(partnerships)
• *payment of salary (closely held
corporations)
• *capital structure—debt versus equity
Examples of business tax
modeling
• (corporations)
• CAPEX decisions
• M&A—electing Sec 338 or not; using tax
free reorganization rules (or not)
• decisions assuming future changes in tax
rates or rules (e.g., taking a deduction now
versus later)
• *putting an overseas affiliate into an entity
(versus keeping it as a branch)
********
DATA ANALYTICS PAPERS:
Deloitte's "US tax data management"
Deloitte's "US tax data analytics - a new era for tax planning and compliance"
KPMG's "Tax Data and Analytics"
These three papers, taken together, represent the "Big Four" roadmap for the modern tax department. While they use slightly different terminology, they all argue that the tax profession is undergoing a fundamental shift from a back-office compliance function to a front-office strategic powerhouse.
Here is one unified summary of the core themes across all three papers:
1. The "Data-Ready" Infrastructure (Management)
Both Deloitte and KPMG emphasize that you cannot have meaningful analytics without a "Tax Data Model."
The Problem: Most companies have ERP systems (SAP, Oracle) that are designed for accounting, not tax. This results in "dirty data" that requires massive manual effort to clean.
The Solution: Both firms advocate for "Tax-Sensitizing" data—building automated pipelines that extract raw data and immediately apply tax logic (e.g., identifying deductible vs. non-deductible meals or calculating R&D credits) before it ever hits a spreadsheet.
Takeaway: Tax data management is the "plumbing" that makes high-level AI and analytics possible.
2. The Evolution of Strategy (Analytics)
Deloitte’s "New Era" and KPMG’s "D&A" paper both describe a three-stage evolution of the tax department:
Descriptive (Hindsight): Using data to see what happened (e.g., "What was our Effective Tax Rate last year?").
Diagnostic (Insight): Using dashboards to see why things happened and identifying anomalies (e.g., "Why is our VAT leakage so high in the EMEA region?").
Predictive (Foresight): This is the "New Era." Instead of just reporting the past, tax departments use AI to model the future. For example, if the company changes its supply chain or if a new tax law is passed, the tax team can immediately model the multi-million dollar impact on the bottom line.
3. The "Chief Data Officer for Tax" (The Human Factor)
A unique point emphasized across these papers is the changing profile of the tax professional.
Hybrid Skills: The papers argue that the next generation of tax leaders must be "bilingual"—fluent in both tax law and data science.
Strategic Seat at the Table: By automating the "boring" 80% of the job (data gathering and basic compliance), the tax team is freed to participate in M&A discussions, supply chain planning, and ESG reporting.
KPMG's "Data by Design": KPMG specifically suggests that organizations should appoint a "Chief Data Officer for Tax" to ensure that as the company builds new IT systems, tax requirements are built into the code from Day 1.
4. Catching Up to the Tax Authorities
A major driver for all these changes is that Tax Authorities (IRS, HMRC, OECD) are already using AI.
Governments are moving toward real-time reporting. If the government’s AI finds an error in your data before your team does, you face massive audits and penalties.
The papers agree: Corporations must adopt these tools just to "keep pace" with the increasingly sophisticated digital scrutiny of global regulators.
Final Bottom Line
If you merge these papers into one message, it is this: Tax is no longer a math problem; it is a data problem. Success in the next decade depends on moving away from manual spreadsheets and toward an automated, "tax-sensitized" data environment where AI provides real-time strategic foresight.
Data Warehousing, ERP, and Tax
Module 4
Professor Chuck Swenson
How Does This affect You?
• To do tax returns, need to get client data
• To do tax consulting projects, need to get
client data
• Clients will usually not allow you to directly
access their data. Instead, you request it
• This module deals with what types of data
there are, where they come from, and using
them
• Example
• Your are doing state corporate returns for a
client. The client has multiple divisions, each
using different data warehouses
• You get info for each division in different
formats. What info? From where?
• You clean, format in the same manner, and
combine the data to do the return
• 2nd Example
• You do a nexus study for a client. Again, the client
has different divisions with different data
warehouses and data is in different formats
• You need to figure out and ask what type of data
you want. Data from where?
• Again, you need to format, clean, and combine
Nomenclature
• Digital Assets- Calculators including programs provided by vendors, internal
customized excel work papers. Also knowhow to link existing programs into an
integrated system.
• Data Script- Data requirements needed to load a system or subsystem to
perform desired measurement and reporting/analytics.
• Data Library- A set of Data Scripts that are “interwoven” such that data is used
for multiple tasks.
• Data Push- IT controls when and how data is made available.
• Data Pull-User is granted right to pull required data from the ERP and other
systems as needed.
• Business Requirements-The definition of the business problem to be addressed.
• Technical Requirements-The definition of the tax technical and calculator
requirements that underpin the Business Requirements.
• Data Foundation-the underlying ability of ERP and other systems to provide the
data requirements per the Data Script.
• Data Gaps-The inability to obtain required data from any source.
• Transformation-The use of various techniques to use available data as a basis of
providing an estimated in lieu of the actual data.
• SIPOC- A acronym for Systems-Inputs-Process-Output-Customers. The steps to
collect the business requirements
• Heat Map- A visual to provide the user with a quick reference to any specific
result in terms of magnitude.
• Benchmarking-a comparison with another population to provide insight via
similarity and divergence
• Trend Analysis-looking at variation to forecast the future.
• Data Cubes-the underlying manner in which discrete data
items are stored in an ERP and other systems.
• Flat File-A file for organizing and storing data for use by a
calculator (flat files include Alteryx SQL, Excel)
• Data Integration-using data from multiple sources in
combination per the technical requirements.
• Data Warehouse – A is a system used for reporting and
data analysis, and is considered a core component of
business intelligence
• ERP – An acronym for Enterprise Resource Planning-
integrated management of main business processes, often
in real time and mediated by software and technology
Ideal System for Multinational
Enterprise (MNE)
One integrated system for all subsidiaries,
divisions, and countries—see next slide (from
Deloitte article)
What Often Happens...
• Separate ERPs and data warehouses for divisions,
entities, countries
• ERPs may be different; warehoused data in
different forms
• Differences due to inertia, or M&A activity
• Tax professional requests data from multiple
systems to do tax compliance or consulting projects
• Tax professional integrates data with Alteryx, SQL,
Excel,other software
SIPOC
• For client’s data processes
• For consultant’s compliance and consulting
processes
• Potentially use in conjunction with Six Sigma
• SIPOC Youtube example:
https://www.youtube.com/watch?v=j9oWnnbtkOo
• Six Sigma Youtube example:
https://www.youtube.com/watch?v=GiCMGL7RqR
w
Tools for Integrating and Cleaning Data
• SQL—relational database tool
• Excel
• Alteryx
Example Tax Projects
• Permanent establishment (PE) study—next
half of our class
• Sec 263A study--UNICAP
• Transfer pricing study (international)
• R&D analysis for tax credit
• Sales/use tax study; reverse audit
• VAT analysis—overpay or underpay?
• Example
• Your are doing state corporate returns for a
client. The client has multiple divisions, each
using different data warehouses
• You get info for each division in different
formats. What info? From where?
• You clean, format in the same manner, and
combine the data to do the return
• 2nd Example
• You do a nexus study for a client. Again, the client
has different divisions with different data
warehouses and data is in different formats
• You need to figure out and ask what type of data
you want. Data from where?
• Again, you need to format, clean, and combine
**************
AI/RPA/ML/Big Data:
ALSO SEE PAPER ON MY ICLOUD DESKTOP CALLED “AI IN TAX” by Ying-Ying Su - KPMG
Robotic Process Automation (RPA)
RPA is a technology that uses software "robots" (bots) to mimic human actions on a computer. Unlike traditional automation, which often requires complex back-end coding and APIs, RPA works on the User Interface (UI)—literally clicking buttons, typing text, and moving files just as a human would.
1. What is RPA?
- At its core, RPA acts as the "hands" of a digital worker. It is best suited for tasks that are:
- Highly Repetitive: Tasks performed frequently (e.g., daily data entry).
- Rule-Based: Clear "if-then" logic with no room for subjective judgment.
- Digital: The data must be available in a digital format (Excel, Web, ERP).
Key Benefits
- Accuracy: Bots don't get tired and don't make typos.
- Speed: They work 24/7 at speeds much faster than a human.
- Legacy Integration: RPA can "talk" to old software that doesn't have modern connection points (APIs).
2. How UiPath Works
UiPath is the market leader in RPA. It works through a three-part ecosystem known as the Studio-Robot-Orchestrator model. You can think of it like a theater production:
A. UiPath Studio (The Director)
- This is the development environment where you "teach" the bot what to do.
- Low-Code: You use a visual "drag-and-drop" interface to build a flowchart of the process.
- Recording: You can hit "Record," perform a task yourself, and the Studio will automatically generate the steps for the bot.
B. UiPath Robot (The Actor)
- This is the software agent that actually executes the instructions created in the Studio. There are two types:
- Attended Bots: These live on your PC and help you with parts of your job (e.g., you click a button to have the bot fill out a long form for you).
- Unattended Bots: These live on a server and work completely on their own, usually triggered by a schedule or an event (e.g., a bot that wakes up at 2 AM to process all the previous day’s invoices).
C. UiPath Orchestrator (The Stage Manager)
- This is a web-based "control tower" that manages your entire fleet of bots.
- Monitoring: It tracks which bots are busy, which are free, and if any encountered errors.
- Scheduling: It tells the bots exactly when to start their tasks.
- Security: It securely stores passwords and credentials so the bots can log into systems without humans seeing the sensitive data.
*******
Use of AI in Tax:
Part 1: The Future of Tax (Sources 1, 2, & 4)
- OECD "Tax Administration 3.0" (2020): This is a landmark report proposing a shift from "digitized" tax (putting paper forms online) to "digital transformation." The vision is a seamless system where tax processes are built into the "natural systems" businesses use (like accounting software and e-commerce platforms). Instead of filing a return, tax is calculated and paid in real-time as transactions occur.
- Analytical Challenges (Butler, 2020): This paper highlights that while technology is advancing, the "human side" of tax administration is struggling. The major challenge is the gap between tax law and data science. It argues that for AI to work in tax, agencies must invest in a talent pool that understands both legal interpretation and data analytics.
- Data Analytics and Tax Law (Alarie et al.): This research discusses how machine learning (ML) can predict legal outcomes. By training models on thousands of past court cases, researchers can predict how a tax authority or court might rule on a specific issue (e.g., "employee vs. contractor" status) with high accuracy, potentially reducing litigation and increasing certainty.
Part 2: Understanding Artificial Intelligence (Source 3)
What is AI? (Darrell West, Brookings): West defines AI not just as "fast computing," but as systems with three specific qualities: Intentionality (designed to make decisions), Intelligence (often paired with machine learning to find trends), and Adaptability (learning from new data to improve over time). He emphasizes that AI is already here—transforming finance, healthcare, and security.
Part 3: Deep Dive into Neural Networks (Sources 5, 6, & 7)
- Neural Nets in Business (Ideamotive): This explains how Neural Networks (NNs) are used for complex business problems like fraud detection, stock market prediction, and customer churn analysis. NNs excel at finding non-linear patterns that traditional "straight-line" statistics (like basic regression) miss.
- A "No-Brainer" Introduction (Alteryx): This is a primer on how NNs mimic the human brain. They consist of "layers":
- Input Layer: The data you feed in.
- Hidden Layers: Where the "math" happens; the model assigns "weights" to different variables to see which ones are most important.
- Output Layer: The final prediction (e.g., "Yes, this is fraud" or "No, it isn't").
- Running Neural Nets in Alteryx: This is a technical guide for the Neural Network Tool in Alteryx. It explains that Alteryx uses a "feed-forward" model (data flows in one direction). Key takeaway: the user doesn't need to write code (R or Python) because the tool handles the complex math, but the user must carefully choose the "number of nodes" in the hidden layer to prevent the model from becoming too complex (overfitting) or too simple (underfitting).
Key Takeaway for You:
If you combine these readings, the message is clear: Tax administration is moving toward a real-time, AI-driven model. To participate in this future, a professional needs to understand the legal framework (OECD/Alarie), the mathematical logic of AI (West/Neural Net Primers), and the practical tools (Alteryx) to execute it.
****
Professor Swenson paper Using Machine Deep Learning AI to Improve Forecasting of Tax Payments for Corporations
This paper, published in October 2024, addresses a practical financial problem: how can corporations accurately forecast and prepay their federal income taxes to avoid penalties? Here is a 2-minute summary:
The Problem: The "Costly Guess"
- Corporations must prepay at least 90% of their annual tax bill in quarterly installments.
- If they underpay: They face IRS penalties.
- If they overpay: They lose "opportunity cost" by letting the government hold onto cash that could have been invested or used for operations. Traditional forecasting (often based on simple linear predictions of annual income) is notoriously difficult because tax laws are complex and corporate income can be volatile.
The Methodology: AI vs. Traditional Stats
- Swenson used a massive dataset from Compustat covering all U.S. publicly traded corporations from 2000 to 2024. This period is particularly challenging because it includes the high volatility of the COVID-19 pandemic. He tested three primary "Deep Learning" and Machine Learning models:
- Neural Networks (Multilayer Perceptrons): Models that mimic brain connections to find non-linear relationships.
- Random Forests: A "forest" of decision trees that work together to make a prediction.
- The Tree Method (Gradient Boosting): Building trees sequentially to correct the errors of previous ones.
The Key Findings
- Superior Accuracy: The AI models—specifically Neural Networks—significantly outperformed traditional forecasting methods.
- Resilience to Volatility: Remarkably, these models remained robust and accurate even during the extreme economic disruptions of the 2020–2022 pandemic years.
- Better Data Utilization: Deep learning was able to handle "noisy" financial data and complex interactions between variables that traditional regression models typically miss.
Why It Matters
- For Corporate Tax Teams: It provides a blueprint for using tools (like Alteryx or Python) to reduce penalty risks and manage cash flow better.
- For Stock Analysts: More accurate tax forecasts lead to better valuations of a company’s "Net Income," making the stock market more efficient.
- For Governments: It helps tax authorities predict revenue streams more reliably.
Bottom Line: Swenson proves that Deep Learning isn't just for tech companies; it has a direct, "dollars and cents" application for corporate tax departments looking to optimize their quarterly cash outflows
Swenson paper on Corporate ETR’s:
This paper, published in the Journal of Information Systems (2021), investigates how to improve the accuracy of textual analysis in accounting by focusing on context. Here is a summary of the core findings and methodology:
The Problem: The "Bag of Words" Limitation
- Most textual analysis in accounting uses a generic "bag of words" approach—counting words from standard finance and accounting dictionaries (like the Loughran and McDonald dictionary) across an entire document (like a 10-K filing).
- The authors argue that this "one-size-fits-all" method loses critical nuances when applied to specialized areas like corporate taxation.
The Solution: Building Context
- The researchers propose two major shifts to improve the prediction of Effective Tax Rates (ETRs):
- Tax-Specific Dictionaries: Instead of general finance terms, they created expert-derived dictionaries specifically for tax-related events (e.g., words like "penalty," "audit," or "tax court").
- Location Matters: Instead of scanning the entire 10-K, they focused their analysis specifically on tax-related sections (such as tax footnotes), where words carry the most relevant meaning.
Key Findings
- Superior Accuracy: Models using tax-specific, expert-derived dictionaries significantly outperformed those using general accounting and finance dictionaries.
- The Power of Narrowing Focus: Applying generic word counts only to tax-related content (rather than the whole document) also provided a statistically significant improvement in model fit.
- Directional Clarity: In generic analysis, positive and negative words often cancel each other out or give mixed signals. The researchers found that their tax-specific dictionaries were much more consistent with theoretical expectations (e.g., positive tax events actually correlated with predictable changes in ETR).
Why It Matters
- The paper demonstrates that context is king. For researchers and practitioners, it suggests that "expertise-driven" textual analysis is far more valuable than "automated generic" analysis. By narrowing the scope to specific sections of a disclosure and using targeted vocabulary, analysts can gain a much clearer picture of a firm's tax strategy and future tax liabilities.
- Two-Sentence Takeaway: Generic textual analysis is often too "noisy" to predict specific outcomes like tax rates. By using expert-curated tax dictionaries and focusing only on tax-related sections of financial reports, you can significantly improve the predictive power of accounting data.
***************
MODULE 7: REGRESSION
What Can Regression
Analysis Do for You?
• Make predictions (based on available information)
• Estimate group means (for similar individuals)
• Measure effects (while controlling for other influences)
• Help evaluate/improve a model (of a relationship
For an individual, predict the value of the dependent variable, given the
values of some of the explanatory variables.
• Process: “Regress” the dependent variable onto the given
explanatory variables. Then “Predict.” Fill in the values of the
explanatory variables. Hit the “Predict” button
3. Measure a Pure Effect
A one-unit difference in an explanatory variable, when everything else of relevance
remains the same, is typically associated with how large a difference in the
dependent variable?
• Process: “Regress” the dependent variable onto all of the relevant explanatory
variables (i.e., use the “most complete” model available).
• Answer: (coefficient of explanatory variable)
± (~2)·(standard error of coefficient)
4. The Explanatory Power of the Model
• Names can vary: The {adjusted, corrected, unbiased} {coefficient of
determination, r-squared} all refer to the same thing.
– Without an adjective, the {coefficient of determination, r-squared} refers to a
number slightly larger than the “correct” number, and is a throwback to pre-
computer days.
• When a new variable is added to a model, which actually contributes
nothing to the model (i.e., its true coefficient is 0), the adjusted
coefficient of determination will, on average, remain unchanged.
– Depending on chance, it might go up or down a bit.
– *If negative, interpret it as 0%.
– The thing without the adjective will always go up. That’s obviously not quite
“right.”
. Modelling: Relative Explanatory Importance
• Rank Mileage, Age, and Make in order of relative importance in helping to explain why Costs vary across the
fleet.
– One standard-deviation’s-worth of variation in Mileage is associated with 1.1531standard-deviation’s-worth of
variation in Costs.
– One standard-deviation’s-worth of variation in Age is associated with 0.5567standard-deviation’s-worth of
variation in Costs.
– One standard-deviation’s-worth of variation in Make is associated with 0.2193standard-deviation’s-worth of
variation in Costs.
You can’t compare regression coefficients directly, since they may carry
different dimensions.
• The beta-weights are dimensionless, and combine how much each explanatory
variable varies, with how much that variability leads to variability in the
dependent variable.
– Specifically, they are the product of each explanatory variable’s standard deviation
(how much it varies) and its coefficient (how much its variation affects the
dependent variable), divided by the standard deviation of the dependent variable
(just to remove all dimensionality).
. Modelling: Which Variables “Belong”
in the (Current) Model?
How strong is the evidence that each explanatory variable has a non-zero coefficient
(i.e., plays a predictive role) in the current model?
• Process: “Regress” the dependent variable onto the (current) set of explanatory
variables. For each explanatory variable, examine the “significance level”
(synonymously, the “p-value”) of the sample data with respect to the null
hypothesis that the true coefficient of that variable is zero.
– The closer the significance level is to 0%, the stronger is the evidence against that null hypothesis (i.e.,
the stronger is the evidence that this variable
Which Variables “Belong” in the (Current) Model?
• How strong is the evidence that Mileage, Age, and Make each “belong”
in the model which predicts Costs from all three?
– For Mileage and Age, “overwhelmingly strong”.
– For Make, a “little bit of supporting evidence”, but not even “moderately strong”.
• With more data, if the true coefficient of Make is non-zero, the significance level will move towards
0%, and the evidence for inclusion will be stronger.
• With more data, if the true coefficient of Make is
really zero, the significance level will stay well above
0%, and the estimate of the coefficient will move
towards 0 (the “truth”).
Modelling: Detection of Modelling Issues
Is a linear model appropriate?
• Judgment (non-statistical)
• Residual analysis (non-statistical)
• Residual plots
– Against explanatory variables
– Against predicted values
• Outlier analysis
Modelling: What is the Structure of the Relationship?
Is a linear model appropriate? Are there alternatives?
• Interactions
• Other nonlinearities
– Quadratic
– Logarithmic – post-course
• In explanatory variables
• In the dependent variable
• Qualitative variables
– Explanatory
– Dependent – post-course
Summary of Amy Gallo’s 2015 Harvard Business Review article, "A Refresher on Regression Analysis”:
This is a foundational guide for managers on how to use one of the most powerful tools in data science to make better business decisions.
Here is a 3-minute summary of the key concepts and takeaways:
1. What is Regression Analysis?
- Gallo defines regression as a mathematical way of sorting out which variables have an impact on a specific outcome. It allows you to move beyond "gut feelings" to determine:
- Which factors matter most? (e.g., Does rain or price impact sales more?)
- Which can we ignore? (e.g., Does a competitor's rumor actually affect us?)
- How do these factors interact?
- How certain are we?
2. The Core Mechanics: Variables and the "Line"
- To understand regression, you must identify two types of variables:
- The Dependent Variable: The main factor you are trying to understand or predict (e.g., monthly sales).
- The Independent Variables: The factors you suspect have an impact on your dependent variable (e.g., rainfall, ad spend, or price).
- The analysis produces a regression line—the "best fit" line through a scatter plot of data points. This line provides a formula that estimates how much the dependent variable will change for every unit of change in the independent variable.
3. Key Business Applications
- Gallo highlights three primary ways companies use this analysis:
- Explanation: Understanding why something happened (e.g., Why did customer service calls drop last month?).
- Prediction: Forecasting future outcomes (e.g., What will our revenue look like in the next quarter?).
- Decision Making: Choosing between different paths (e.g., Should we spend money on a new promotion or a product feature?).
4. Critical Warnings: The "Gotchas"
- The article emphasizes that even the best math can be misleading if misapplied. Gallo offers three major cautions:
- Correlation is NOT Causation: This is the most important rule. Just because two things move together (e.g., rain and sales) doesn't mean one caused the other. You must look for the "physical mechanism" in the real world to prove a causal link.
- Beware of "Garbage In, Garbage Out": Regression is extremely sensitive to bad data. If your data collection is "leaky" or biased, your results will be useless or even dangerous.
- The Danger of Overcomplicating: While you can include many variables, adding too many can make the model "overfit" the data, making it less accurate for future predictions.
5. Final Takeaway for Managers
You don't need to do the math yourself—software like Excel or Stata can handle the crunching. Your job as a manager is to ask the right questions, ensure the data is trustworthy, and interpret the results with a healthy dose of skepticism and real-world observation. Gallo's final advice: the goal isn't just to understand the data; it’s to understand what is happening in the world so you can act on it.