Notes on MySQL Usage
From NSMWiki
This page is a temporary repository for some notes on how Sguil uses MySQL. Specifically, the notes are probably going to be related to performance of the SANCP table.
Database Schema
Here's a fragment of Tcl code from Sguild that describes the schema of the main SANCP merge table:
CREATE TABLE sancp \ ( \ sid INT UNSIGNED NOT NULL, \ sancpid BIGINT UNSIGNED NOT NULL, \ start_time DATETIME NOT NULL, \ end_time DATETIME NOT NULL, \ duration INT UNSIGNED NOT NULL, \ ip_proto TINYINT UNSIGNED NOT NULL, \ src_ip INT UNSIGNED, \ src_port SMALLINT UNSIGNED, \ dst_ip INT UNSIGNED, \ dst_port SMALLINT UNSIGNED, \ src_pkts INT UNSIGNED NOT NULL, \ src_bytes INT UNSIGNED NOT NULL, \ dst_pkts INT UNSIGNED NOT NULL, \ dst_bytes INT UNSIGNED NOT NULL, \ src_flags TINYINT UNSIGNED NOT NULL, \ dst_flags TINYINT UNSIGNED NOT NULL, \ INDEX p_key (sid,sancpid), \ INDEX src_ip (src_ip), \ INDEX dst_ip (dst_ip), \ INDEX dst_port (dst_port), \ INDEX src_port (src_port), \ INDEX start_time (start_time) \ ) TYPE=MERGE UNION=([join $tmpTables ,]) \
The merge table is made up of identically defined sancp tables, one for each day and each sensor. For example, for sensors named "internet", "business" and "education", you'd have tables:
sancp_internet_20080101 sancp_business_20080101 sancp_business_20080101 sancp_internet_20080102 sancp_business_20080102 sancp_business_20080102
Note that sguild takes care of automatically creating the individual tables whenever the first new data arrives for each sensor at the beginning of the day. However, these tables are never searched directly. Sguild only search the main "sancp" MERGE table.
Misc Notes
- it's not uncommon to have 500 million rows, though, or more. Some have more like 2 billion.
- MERGE really helps with data management. The data isn't stagnant. Rows are constantly being inserted, sometimes by the tens of thousand per 5/mins.
- Most of the searches involve a combination of time based searches(start_time) or IP (src_ip, dst_ip), though there are some port basedqueries from time-to-time
- At some point old data has to be purged. Being able to quickly drop the oldest data is a requirement.
- So, our main issue is with speeds of the queries and INSERTs(which is really a LOAD).
- the JOIN we do for queries might be an issue. The sancp table only has a sensor ID (sid). But all queryresults require a sensor name. So we JOIN the sensor table to get thatinfo. Perhaps it'd be better to simply insert the sensor name as well?
- I'm pretty sure the load problem is really that it takes a long time to update the indices, and it's being done on each table about once per minute or so. I would say every search has a timestamp requirement. Most also have the either the src or dst ip. Port searches are far less common, but not unheard of.
- indices (.MYI files) are about 2x the size of the data (.MYD files)

