db2ixf
⚓︎
Helps the user to parse PC/IXF file format of IBM DB2.
IXF file is organised in a sequence of records. these records have 5 main types: Header, Table, Column Descriptor, Data and Application.
Inside the IXF file, these records are ordered which means that it starts with a header record, table one, set of column descriptors - where each column descriptor is also a record - ant it ends with the set of data records.
IXF = H + T + Set(C) + Set(D).
Each record type is represented by a list of fields and each field has a length in bytes that we will use to read data from the IXF file.
For more information about record types; Please visit this link.
Data records [Set(D)] stores the data we want to extract, which means that for each column we need to extract its content from the data record. Each column has its data type.
For more information about data types; Please visit this link.
Classes⚓︎
IXFParser(file)
⚓︎
PC/IXF Parser.
Attributes:
Name | Type | Description |
---|---|---|
file |
str, Path, PathLike or File-Like Object
|
Input file and it is better to use file-like object. |
Init an instance of the PC/IXF Parser.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file |
str, Path, PathLike or File-Like Object
|
Input file and it is better to use file-like object. |
required |
Source code in src/db2ixf/ixf.py
Attributes⚓︎
header_record: OrderedDict = OrderedDict()
instance-attribute
⚓︎
Contains header metadata extracted from the ixf file.
table_record: OrderedDict = OrderedDict()
instance-attribute
⚓︎
Contains table metadata extracted from the ixf file.
column_records: List[OrderedDict] = []
instance-attribute
⚓︎
Contains columns description extracted from the ixf file.
pyarrow_schema: Schema = schema([])
instance-attribute
⚓︎
Pyarrow schema extracted from the ixf file.
current_data_record: OrderedDict = OrderedDict()
instance-attribute
⚓︎
Contains current data record extracted from ixf file.
end_data_records: bool = False
instance-attribute
⚓︎
Flag the end of the data records in the ixf file.
current_row: OrderedDict = OrderedDict()
instance-attribute
⚓︎
Contains parsed data extracted from a data record of the ixf file.
current_row_size: int = 0
instance-attribute
⚓︎
Current row size in bytes
current_total_size: int = 0
instance-attribute
⚓︎
Current total size of the rows
number_rows: int = 0
instance-attribute
⚓︎
Number of rows extracted from the ixf file.
number_corrupted_rows: int = -1
instance-attribute
⚓︎
Number of corrupted rows in the ixf file.
opt_batch_size: int = init_opt_batch_size(self.file_size)
instance-attribute
⚓︎
Estimated optimal batch size
Functions⚓︎
__read_header(record_type=None)
⚓︎
Read the header record.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
record_type |
dict
|
Dictionary containing the names of the record fields and their length. |
None
|
Returns:
Type | Description |
---|---|
dict
|
Header record of the input file. |
Source code in src/db2ixf/ixf.py
__read_table(record_type=None)
⚓︎
Read the table record.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
record_type |
dict
|
Dictionary containing the names of the record fields and their length. |
None
|
Returns:
Type | Description |
---|---|
dict
|
Table record of the input file. |
Source code in src/db2ixf/ixf.py
__read_column_records(record_type=None)
⚓︎
Read the column records.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
record_type |
dict
|
Dictionary containing the names of the record fields and their length. |
None
|
Returns:
Type | Description |
---|---|
List[OrderedDict]
|
Column descriptors records of the input file. |
Raises:
Type | Description |
---|---|
NotValidColumnDescriptorException
|
If the IXF contains a non valid column descriptor. |
Source code in src/db2ixf/ixf.py
__read_data_record(record_type=None)
⚓︎
Read one data record.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
record_type |
dict
|
Dictionary containing the names of the record fields and their length. |
None
|
Returns:
Type | Description |
---|---|
OrderedDict
|
Dictionary containing current data record from IXF file. |
Source code in src/db2ixf/ixf.py
__parse_data_record()
⚓︎
Parses one data record.
It collects data from fields of the current data record.
Returns:
Name | Type | Description |
---|---|---|
OrderedDict |
OrderedDict
|
Dictionary containing all extracted data from fields of the data record. |
Source code in src/db2ixf/ixf.py
__update_statistics()
⚓︎
Update stats and change state of the parser
Source code in src/db2ixf/ixf.py
__parse_all_data_records()
⚓︎
Parses all the data records.
Yields:
Type | Description |
---|---|
dict
|
Parsed row data from IXF file. |
Source code in src/db2ixf/ixf.py
__start_parsing()
⚓︎
Starts the parsing.
Source code in src/db2ixf/ixf.py
start_parsing()
⚓︎
__check_parsing()
⚓︎
Do some checks on the parsing.
Source code in src/db2ixf/ixf.py
check_parsing()
⚓︎
Do some checks on the parsing.
Returns:
Type | Description |
---|---|
bool
|
True if parsing and/or conversion are ok. |
Raises:
Type | Description |
---|---|
IXFParsingError
|
In case it encounters a parsing error. |
Source code in src/db2ixf/ixf.py
__iter_row()
⚓︎
Yields extracted rows (Without parsing of header, table, cols).
iter_row()
⚓︎
Yields parsed rows.
It won’t work if you use it alone. you need to start parsing with
start_parsing
method then you can iterate over rows using iter_row
.
Most of the time, you do not need to use this method.
You will need it in case you want to customize the parsing for example adding support for a new output format.
Source code in src/db2ixf/ixf.py
__iter_batch_of_rows(data=None, batch_size=None)
⚓︎
Yields batch of parsed rows.
Source code in src/db2ixf/ixf.py
iter_batch_of_rows(data=None, batch_size=None)
⚓︎
Yields batches of parsed rows.
It won’t work if you use it alone. you need to start parsing with
start_parsing
method then you can iterate over batch of rows using
iter_batch_of_rows
method. Most of the time, you do not need to use
this method.
You will need it in case you want to customize the parsing for example adding support for a new output format.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data |
Iterable[Dict]
|
Data extracted from ixf file (parsed rows). |
None
|
batch_size |
int
|
Batch size. |
None
|
Yields:
Type | Description |
---|---|
List[Dict]
|
Batch of |
Raises:
Type | Description |
---|---|
IXFParsingError
|
In case it encounters a parsing error. |
Source code in src/db2ixf/ixf.py
__iter_pyarrow_record_batch(data=None, batch_size=None)
⚓︎
Yields pyarrow record batches from an iterable of rows.
Source code in src/db2ixf/ixf.py
iter_pyarrow_record_batch(data=None, batch_size=None)
⚓︎
Yields pyarrow record batches.
It won’t work if you use it alone. you need to start parsing with
start_parsing
method, create the pyarrow schema using
get_or_create_pyarrow_schema
method then you can iterate over record
batches using iter_batch_of_rows
method. Most of the time, you do not
need to use this method.
You will need it in case you want to customize the parsing for example adding support for a new output format.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data |
Iterable[Dict]
|
Data extracted from ixf file (parsed rows). |
None
|
batch_size |
int
|
Batch size. |
None
|
Yields:
Type | Description |
---|---|
RecordBatch
|
Pyarrow record batch. |
Raises:
Type | Description |
---|---|
IXFParsingError
|
In case it encounters a parsing error. |
Source code in src/db2ixf/ixf.py
__get_or_create_pyarrow_schema(pyarrow_schema=None, for_delta=False)
⚓︎
Get or create pyarrow schema based on the scope it will be used.
Source code in src/db2ixf/ixf.py
get_or_create_pyarrow_schema(pyarrow_schema=None, for_delta=False)
⚓︎
Get or create pyarrow schema based on the scope of the usage.
It won’t work if you use it alone. you need to start parsing with
start_parsing
method then you can create the pyarrow schema using
get_or_create_pyarrow_schema
method. After applying it, you will maybe
need to iterate over pyarrow record batches. Most of the time, you do
not need to use this method.
You will need it in case you want to customize the parsing for example adding support for a new output format.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
pyarrow_schema |
Schema
|
Pyarrow schema. |
None
|
for_delta |
bool
|
If True, it adapts pyarrow schema for deltalake usage. |
False
|
Returns:
Type | Description |
---|---|
Schema
|
Pyarrow schema |
Source code in src/db2ixf/ixf.py
get_row()
⚓︎
Yields parsed rows.
Yields:
Type | Description |
---|---|
Dict
|
Generated parsed row. |
Raises:
Type | Description |
---|---|
IXFParsingError
|
In case it encounters a parsing error. |
Source code in src/db2ixf/ixf.py
Python | |
---|---|
parse()
⚓︎
Yields parsed rows.
Alias for get_row
for compatibility with old versions.
Yields:
Type | Description |
---|---|
Dict
|
Generated parsed row. |
Raises:
Type | Description |
---|---|
IXFParsingError
|
In case it encounters a parsing error. |
Source code in src/db2ixf/ixf.py
get_all_rows()
⚓︎
Get all the parsed rows from the ixf file.
Returns:
Type | Description |
---|---|
List[Dict]
|
List of all extracted rows. |
Raises:
Type | Description |
---|---|
IXFParsingError
|
In case it encounters a parsing error. |
Notes
- Attention: it loads all the extracted rows into memory.
Source code in src/db2ixf/ixf.py
to_json(output)
⚓︎
Parses and converts to JSON format.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output |
Union[str, Path, PathLike, IO]
|
Output file. It is better to use file-like object. |
required |
Returns:
Type | Description |
---|---|
bool
|
True if the parsing and conversion are ok. |
Raises:
Type | Description |
---|---|
IXFParsingError
|
In case it encounters a parsing error. |
Source code in src/db2ixf/ixf.py
to_jsonline(output)
⚓︎
Parses and converts to JSON LINE format.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output |
Union[str, Path, PathLike, IO]
|
Output file. It is better to use file-like object. |
required |
Returns:
Type | Description |
---|---|
bool
|
True if the parsing and conversion are ok. |
Raises:
Type | Description |
---|---|
IXFParsingError
|
In case it encounters a parsing error. |
Source code in src/db2ixf/ixf.py
to_csv(output, sep='|', batch_size=None)
⚓︎
Parses and converts to CSV format.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output |
Union[str, Path, PathLike, TextIO]
|
Output file. It is better to use file-like object |
required |
sep |
str
|
Separator/delimiter of the columns. |
'|'
|
batch_size |
int
|
Batch size, it used for memory optimization |
None
|
Returns:
Type | Description |
---|---|
bool
|
True if the parsing and conversion are ok |
Raises:
Type | Description |
---|---|
IXFParsingError
|
In case it encounters a parsing error. |
Source code in src/db2ixf/ixf.py
get_pyarrow_record_batch(data=None, batch_size=None, for_delta=False)
⚓︎
Yields pyarrow records batches.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data |
Iterable[Dict]
|
Data extracted from ixf file (parsed rows). |
None
|
batch_size |
int
|
Batch size. |
None
|
for_delta |
bool
|
If True, it adapts pyarrow schema for deltalake usage. |
False
|
Yields:
Type | Description |
---|---|
RecordBatch
|
Pyarrow record batch. |
Raises:
Type | Description |
---|---|
IXFParsingError
|
In case it encounters a parsing error. |
Source code in src/db2ixf/ixf.py
to_parquet(output, parquet_version='2.6', batch_size=None)
⚓︎
Parses and converts to PARQUET format.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output |
Union[str, Path, PathLike, BinaryIO]
|
Output file. It is better to use file-like object. |
required |
parquet_version |
str
|
Parquet version. Please see pyarrow documentation. |
'2.6'
|
batch_size |
int
|
Number of rows to extract before writing to the parquet file. It is used for memory optimization. |
None
|
Returns:
Type | Description |
---|---|
bool
|
True if the parsing and conversion are ok. |
Raises:
Type | Description |
---|---|
IXFParsingError
|
In case it encounters a parsing error. |
Source code in src/db2ixf/ixf.py
to_deltalake(table_or_uri, partition_by=None, mode='error', overwrite_schema=False, schema_mode=None, partition_filters=None, large_dtypes=False, batch_size=None, **kwargs)
⚓︎
Parses and converts to a deltalake table.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
table_or_uri |
Union[str, pathlib.Path, DeltaTable]
|
URI of a table or a DeltaTable object. |
required |
partition_by |
Optional[Union[List[str], str]]
|
List of columns to partition the table by. Only required when creating a new table. |
None
|
mode |
Literal['error', 'append', 'overwrite', 'ignore']
|
How to handle existing data. Default is to error if table already exists. If “append”, will add new data. If “overwrite”, will replace table with new data. If “ignore”, will not write anything if table already exists. |
'error'
|
overwrite_schema |
bool
|
If True, allows updating the schema of the table. |
False
|
schema_mode |
Optional[Literal['merge', 'overwrite']]
|
If set to “overwrite”, allows replacing the schema of the table. Set to “merge” to merge with existing schema. |
None
|
partition_filters |
Optional[List[Tuple[str, str, Any]]]
|
Defaults to None. The partition filters that will be used for partition overwrite. Only used in pyarrow engine. |
None
|
large_dtypes |
bool
|
If True, the table schema is checked against large_dtypes. |
False
|
batch_size |
int
|
Number of rows to extract before conversion operation. It is used for memory optimization. |
None
|
**kwargs |
Optional[dict]
|
Some of the arguments you can give to this function
|
{}
|
Returns:
Name | Type | Description |
---|---|---|
bool |
bool
|
True if the parsing and conversion are ok. |
Source code in src/db2ixf/ixf.py
Python | |
---|---|
952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 |
|