Query JSON, HiveQL, BigQuery, and Python/R Analytics
Learn NoSQL and Big Data SQL concepts in this comprehensive guide. Master SQL-like querying in NoSQL databases, HiveQL, BigQuery SQL, and integration with Python/R for analytics. Practice querying JSON/document data and aggregating large datasets efficiently.
1. Introduction
With Big Data and NoSQL databases, traditional SQL queries need adaptation.
- NoSQL databases like MongoDB, Couchbase, and Cassandra store JSON, key-value, or document data.
- Big Data frameworks (Hive, BigQuery, Spark SQL) allow SQL-like queries on massive datasets.
- Integration with Python or R enables advanced analytics and visualization.
Key Points:
- SQL-like syntax simplifies querying non-relational data.
- Big Data SQL tools can process millions to billions of records efficiently.
- Python/R integration allows data science workflows on top of SQL queries.
2. SQL-like Querying in NoSQL Databases
2.1 Query JSON Documents (MongoDB Example)
-- Find employees with salary > 50000
db.Employees.find({ "Salary": { $gt: 50000 } })
-- Project only Name and Department
db.Employees.find({ "Salary": { $gt: 50000 } }, { Name: 1, Department: 1, _id: 0 })
-- Aggregation: Total salary per department
db.Employees.aggregate([
{ $group: { _id: "$Department", TotalSalary: { $sum: "$Salary" } } }
])
3. HiveQL / BigQuery SQL
Hive and BigQuery allow SQL-like queries on distributed data stored in Hadoop or cloud storage.
3.1 HiveQL Example
-- Count employees per department
SELECT Department, COUNT(*) AS TotalEmployees
FROM employees
GROUP BY Department;
-- Filter high salaries
SELECT Name, Salary
FROM employees
WHERE Salary > 50000;
3.2 BigQuery Example
-- Aggregate sales by region
SELECT Region, SUM(Sales) AS TotalSales
FROM `project.dataset.Sales`
GROUP BY Region
ORDER BY TotalSales DESC;
4. Integration with Python / R for Analytics
4.1 Python with Pandas & SQL
import pandas as pd
import sqlite3
# Connect to database
conn = sqlite3.connect('Sales.db')
# Query large dataset
df = pd.read_sql_query("SELECT Region, SUM(Sales) AS TotalSales FROM Sales GROUP BY Region", conn)
# Perform analytics
df['RunningTotal'] = df['TotalSales'].cumsum()
print(df)
4.2 R with DBI and dplyr
library(DBI)
library(dplyr)
con <- dbConnect(RSQLite::SQLite(), "Sales.db")
sales_summary <- tbl(con, "Sales") %>%
group_by(Region) %>%
summarise(TotalSales = sum(Sales)) %>%
arrange(desc(TotalSales))
print(sales_summary)
5. Practical Exercises
- Query JSON/document data in MongoDB for employees earning > 50,000.
- Aggregate department salaries using MongoDB aggregation pipeline.
- Use HiveQL or BigQuery to count records and compute sums for large datasets.
- Connect Python to a SQL or NoSQL database and perform grouped analytics.
- Compute running totals, moving averages, or rankings on large datasets in Python/R.
6. Tips for Beginners
- Start with small datasets before scaling to Big Data.
- Use aggregation pipelines in NoSQL for analytics queries.
- Leverage SQL-like syntax in Hive/BigQuery for easier transition from traditional SQL.
- Python/R integration is essential for visualization and advanced analytics.
- Always optimize queries to handle large datasets efficiently.
Next Step: After completing this optional module, you will have a complete end-to-end roadmap from beginner to advanced SQL, including relational SQL, analytics, optimization, security, backup, and NoSQL/Big Data concepts.