Query JSON, HiveQL, BigQuery, and Python/R Analytics

Learn NoSQL and Big Data SQL concepts in this comprehensive guide. Master SQL-like querying in NoSQL databases, HiveQL, BigQuery SQL, and integration with Python/R for analytics. Practice querying JSON/document data and aggregating large datasets efficiently.

1. Introduction

With Big Data and NoSQL databases, traditional SQL queries need adaptation.

NoSQL databases like MongoDB, Couchbase, and Cassandra store JSON, key-value, or document data.
Big Data frameworks (Hive, BigQuery, Spark SQL) allow SQL-like queries on massive datasets.
Integration with Python or R enables advanced analytics and visualization.

Key Points:

SQL-like syntax simplifies querying non-relational data.
Big Data SQL tools can process millions to billions of records efficiently.
Python/R integration allows data science workflows on top of SQL queries.

2. SQL-like Querying in NoSQL Databases

2.1 Query JSON Documents (MongoDB Example)

-- Find employees with salary > 50000

db.Employees.find({ "Salary": { $gt: 50000 } })

-- Project only Name and Department

db.Employees.find({ "Salary": { $gt: 50000 } }, { Name: 1, Department: 1, _id: 0 })

-- Aggregation: Total salary per department

db.Employees.aggregate([

{ $group: { _id: "$Department", TotalSalary: { $sum: "$Salary" } } }

])

3. HiveQL / BigQuery SQL

Hive and BigQuery allow SQL-like queries on distributed data stored in Hadoop or cloud storage.

3.1 HiveQL Example

-- Count employees per department

SELECT Department, COUNT(*) AS TotalEmployees

FROM employees

GROUP BY Department;

-- Filter high salaries

SELECT Name, Salary

FROM employees

WHERE Salary > 50000;

3.2 BigQuery Example

-- Aggregate sales by region

SELECT Region, SUM(Sales) AS TotalSales

FROM `project.dataset.Sales`

GROUP BY Region

ORDER BY TotalSales DESC;

4. Integration with Python / R for Analytics

4.1 Python with Pandas & SQL

import pandas as pd

import sqlite3

# Connect to database

conn = sqlite3.connect('Sales.db')

# Query large dataset

df = pd.read_sql_query("SELECT Region, SUM(Sales) AS TotalSales FROM Sales GROUP BY Region", conn)

# Perform analytics

df['RunningTotal'] = df['TotalSales'].cumsum()

print(df)

4.2 R with DBI and dplyr

library(DBI)

library(dplyr)

con <- dbConnect(RSQLite::SQLite(), "Sales.db")

sales_summary <- tbl(con, "Sales") %>%

group_by(Region) %>%

summarise(TotalSales = sum(Sales)) %>%

arrange(desc(TotalSales))

print(sales_summary)

5. Practical Exercises

Query JSON/document data in MongoDB for employees earning > 50,000.
Aggregate department salaries using MongoDB aggregation pipeline.
Use HiveQL or BigQuery to count records and compute sums for large datasets.
Connect Python to a SQL or NoSQL database and perform grouped analytics.
Compute running totals, moving averages, or rankings on large datasets in Python/R.

6. Tips for Beginners

Start with small datasets before scaling to Big Data.
Use aggregation pipelines in NoSQL for analytics queries.
Leverage SQL-like syntax in Hive/BigQuery for easier transition from traditional SQL.
Python/R integration is essential for visualization and advanced analytics.
Always optimize queries to handle large datasets efficiently.

Next Step: After completing this optional module, you will have a complete end-to-end roadmap from beginner to advanced SQL, including relational SQL, analytics, optimization, security, backup, and NoSQL/Big Data concepts.