Skip to main content

Data Dump Process

Description:

Data dump is a process to extract raw data from a database and write it down into a csv file to perform further analysis. It runs daily at morning 6.15 AM. Below are the detail breakdown of working of dump process;

  1. Retrieve Active Dump Configurations

    Executes a SQL query to fetch active dump configurations from the dmp_ table. These include procedure name, table name, file name prefix, page size, column projections, query string, and dump ID.

  2. Count Total Rows to Process

    Executes a COUNT(*) query to determine the total number of rows in the target table or a custom query. For specific cases like dmpId == 40 (mCCTs dumps), where the query lacks the DISTINCT keyword, the implementation doesn’t replace DISTINCT with COUNT for pagination purposes. In such cases, the count is fixed at 6,000,000.

  3. Calculate Pagination

    Computes the number of pages based on the total rows and page size.

  4. Fetch Column Names

    Retrieves column names for the target table or uses the projection string from the configuration.

  5. Paginate and Generate CSV Files

    Iterates through each page and performs the following steps:

  • Define File Name: Constructs a unique file name using the page range and current date.
  • Create ZIP File: Initializes a ZIP file and prepares it to hold the CSV file.
  • Write Column Headers: Writes the column headers into the CSV.
  • Fetch Data Rows: Executes a paginated SQL query using LIMIT to fetch rows for the current page.
  • Write Data Rows to CSV: Write each data row into the CSV file. Clears the session periodically to manage memory.
  • Handle Scrollable Query Errors: Logs any errors encountered during data retrieval and sends an email notification.
  1. Close ZIP File

    Completes the ZIP entry, flushes, and closes the ZIP file.

  2. Store File Information in Database

    Creates a DownloadableReport record with details like file name, path, type, and size. Saves the record in the database within a transaction.

Schedule:

Job NameSchedule
DataDumperJob.javaEvery day at 06:15 AM

Schema:

Below are the key tables used by DataDumperJob.java:

  • dmp_
  • downloadablereport

Dependencies:

None.

Process Flow:

Data Dump Process