Embarking on a Data Engineering Journey
As someone who has long navigated the world of R programming, the increasing allure of data engineering has piqued my interest. While I’m familiar with tackling data challenges in R, I find that many emerging tools cater to these complexities more effectively than my traditional methods allow for. So, I decided to take a closer look at the available options and how they contrast with my typical R-based approaches.
Currently, I’m on the lookout for new opportunities in data and coding. If your organization aligns with my skill set, I'd love to connect; feel free to reach out at
[email protected].
One of my core beliefs is that hands-on experience is the best teacher. Sure, I could have leveraged AI tools like Claude Code to automate my entire project setup, but I wanted to immerse myself in the process. By encountering and resolving issues firsthand, I'll gain insight into where the real complexities lie and adjust my prior assumptions accordingly. With Claude's assistance—just in a chat form to provide guidance—I navigated through various stages, allowing it to tidy up my SQL code where necessary. The aim here wasn't to perfect my SQL but to grasp the overall workflow, which will undoubtedly inform my future work.
In contemplating a practical project for this exploration, I chose to focus on my personal finance management. Previously, I relied on QuickBooks for its ability to connect with my bank and categorize both personal and freelance business expenses. Now, I’m looking to create a more manual, tailored approach—what I’m calling my “slowbooks” workflow, as my bank doesn’t support an API for direct access.
The methodologies I’ll be evaluating leverage the concept of a
Makefile, which structures commands based on their interdependencies, ensuring that only modified steps are executed again. Although one could theoretically rely exclusively on basic
Makefiles (or the newer
just implementation), I’m interested in how these two approaches can offer more structured solutions to my needs.
For those eager to skip to specific sections, here's a handy roadmap outlining the article's structure: [dbt](https://jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/#dbt), [{targets}](https://jcarroll.com.au/2026/05/04/comparing-r-s-targets-and-dbt-for-data-engineering/#targets), and an in-depth comparison of workflows among others.
What I find particularly compelling about this endeavor is not merely the tools at play, but the opportunity to deepen my understanding of data engineering processes and practices. If you're working in this space, recognizing the nuanced differences between methodologies can greatly enhance how data-driven tasks are approached in your own projects.
In this section, we see the transition from a basic staging process to a more refined categorization phase. Although at first glance, the difference between the intermediate and final stages may seem minimal, it’s essential to highlight how this phase integrates data with merchant categories. The transactions gathered during the staging process are not simply passed through; they undergo a meticulous categorization by matching against patterns specified in the seed file.
An important analytical step is the categorization of transaction items based on the merchant file. Here’s how it works:
```sql
with transactions as (
select * from {{ ref('stg_transactions') }}
),
merchants as (
select * from {{ ref('seed_merchants') }}
),
matched as (
...
)
select ...
```
This SQL snippet employs a join operation that uses the `ILIKE` condition to identify the most relevant merchant category as per the description provided in transaction data, ensuring that for each transaction, only the categorical match with the greatest specificity is selected. This is intelligent data modeling at work; it extracts meaningful insights from raw data.
Once categorization is in place, the next step is to create monthly aggregation tables. Utilizing the `date_trunc()` function organizes expenditures cleanly by month, whereas the ability to sum values ensures that transaction counts and totals are computed effectively:
```sql
select
date_trunc('month', date)::date as month,
sum(amount_aud) as total_spend_aud,
count(*) as transaction_count
from {{ ref('int_transactions_categorised') }}
group by 1
```
This approach not only provides aggregates but also lays a foundation for further analysis—like examining trends in spending behavior.
Now, this process could also be executed using R, which enhances the matching procedure through the `{fuzzyjoin}` package for more nuanced data manipulation. The methodology translates well into R, making it viable for analysts who prefer leveraging statistical programming over SQL for data transformation:
```r
categorise_transactions <- function(transactions, merchants) {
...
}
```
With R, the categorization and monthly balance summarization can be orchestrated with ease, enabling the data pipeline to evolve fluidly through each progressive step. The systematic setup allows for dependencies and functions to take charge over the entirety of categorization, which keeps the workflow neat and efficient.
Incorporating these techniques hinges on understanding the workflow intricacies. If you're in data engineering, recognizing the dual capabilities of SQL for processing and R for analytical depth reveals pathways for optimized data handling. Each method complements the other, underscoring the flexibility in modern data ecosystems as they accommodate varying preferences and technical proficiencies.
Final Thoughts on Workflow Efficiency
Reviewing the outcomes of the workflow reveals a lot about efficiency and data management. The reports show that out of the 13 tasks run, only one incurred a warning, which indicates a well-structured process overall. The warning in the test for transaction categorization — while it’s not ideal to have any warnings — is more of a minor headache than a crisis. It points to a handful of uncategorized records that, while they could complicate future analyses, don’t derail the functionality of the system.
The real takeaway, though, is in the execution. Completing the workflow in a mere four seconds is impressive. This rapid turnaround underscores the effectiveness of streamlining these models into a cohesive whole. Being able to visualize the results immediately in a database, such as with DuckDB, highlights the power of having a well-organized data pipeline. You can see tangible outcomes from your coding efforts almost instantly.
With that said, the option to integrate more write calls to the database in other parts of the workflow looms large. While it’s tempting to push for further data persistence at every stage, it's vital to weigh the performance implications. This calls for your attention if you're working in the data engineering space: you have to balance efficiency with thoroughness.
This brings us to a pressing point of consideration: how you manage changes across the Directed Acyclic Graph (DAG) of tasks. The clear differentiation between the incremental model of dbt and the {targets} methodology cannot be overstated. You'll find that {targets}' approach, which reliant on data hashes to determine which parts of the workflow need rerunning, promises significant operational efficiency.
For those engaged in similar data handling processes or considering new solutions, these details aren't just academic; they have real-world consequences. Understanding the nuances can save time and resources, making your workflows not only faster but also smarter. Whether it's dbt with its persistent modeling strategy or {targets} with its adaptive nature, the future of data manipulation looks promising yet nuanced. The direction you choose may depend on your specific needs, but the lessons from these workflows offer invaluable insight into optimizing for both speed and reliability.