Package overview

Note

In the following code samples, the h5features package is imported as:

import h5features as h5f

Brief

The h5features package allows you to easily interface your code with a HDF5 file. It is designed to efficiently read and write large features datasets. It is a wrapper on h5py and and is used for exemple in the ABXpy package.

  • Package organization:

    The main classes composing the package are h5f.Writer and h5f.Reader, which respectively write to and read from HDF5 files, and h5f.Data which interface that data with your code.

  • Data structure:

    The h5features data is structured as a follows

    • a list of items represented by their names (files names for exemple),
    • for each item, some attached features as a numpy array,
    • some labels information attached to features, also as numpy arrays.
  • File structure:

    In a h5features file, data is stored as a HDF5 group. The underlying group structure directly follows data organization. A h5features group mainly stores a version attribute and the following datasets: items, labels, features and index.

Description

The h5features package provides efficient and flexible I/O on a (potentially large) collection of (potentially small) 2D datasets with one fixed dimension (the ‘feature’ dimension, identical for all datasets) and one variable dimension (the ’label’ dimension, possibly different for each dataset). For example, the collection of datasets can correspond to speech features (e.g. MFC coefficients) extracted from a collection of speech recordings with variable durations. In this case, the ‘label’ dimension corresponds to time and the meaning of the ‘feature’ dimension depends on the type of speech features used.

The h5features package can handle small or large collections of small or large datasets, but the case that motivated its design is that of large collections of small datasets. This is a common case in speech signal processing, for example, where features are often extracted separately for each sentence in multi-hours recordings of speech signal. If the features are stored in individual files, the number of files becomes problematic. If the features are stored in a single big file which does not support partial I/O, the size of the file becomes problematic. To solve this problem, h5features is built on top of h5py, a python binding of the HDF5 library, which supports partial I/O. All the items in the collection of datasets are stored in a single file and an indexing structure allows for efficient I/O on single items or on contiguous groups of items. h5features also indexes the ‘label’ dimension of each individual dataset and allow partial I/O along it. To continue our speech features example, this means that it is possible to load just the features for a specific time-interval in a specific utterance (corresponding to a word or phone of interest for instance). The labels indexing the ‘label’ dimension typically correspond to center-times or time-intervals associated to each feature vector in a dataset.

Command line converter

The scipt convert2h5features allows you to simply convert a set of files to a single h5features file. Supported files format are numpy NPZ and Octave/Matlab mat files.

For more info on that script, have a:

$ convert2h5features --help

Basic usage

import h5features as h5f

########################
# Prelude to the exemple
########################

def generate_data(nitem, nfeat=2, dim=10, labeldim=1, base='item'):
    """Returns a randomly generated h5f.Data instance.

    - nitem is the number of items to generate.
    - nfeat is the number of features to generate for each item.
    - dim is the dimension of the features vectors.
    - base is the items basename
    - labeldim is the dimension of the labels vectors.
    """
    import numpy as np

    # A list of item names
    items = [base + '_' + str(i) for i in range(nitem)]

    # A list of features arrays
    features = [np.random.randn(nfeat, dim) for _ in range(nitem)]

    # A list on 1D or 2D times arrays
    if labeldim == 1:
        labels = [np.linspace(0, 1, nfeat)] * nitem
    else:
        t = np.linspace(0, 1, nfeat)
        labels = [np.array([t+i for i in range(labeldim)])] * nitem

    # Format data as required by the writer
    return h5f.Data(items, labels, features, check=True)

########################
# Writing data to a file
########################

# Generate some data for 100 items
data = generate_data(100)

# Initialize a writer, write the data in a group called 'group1' and
# close the file
writer = h5f.Writer('exemple.h5')
writer.write(data, 'group1')
writer.close()

# More pythonic, the with statement
with h5f.Writer('exemple.h5') as writer:
    # Write the same data to a second group
    writer.write(data, 'group2')

    # You can append new data to an existing group if all items have
    # different names. Here we generate 10 more items and append them
    # to the group 2, which now stores 110 items.
    data2 = generate_data(10,  base='item2')
    writer.write(data2, 'group2', append=True)

    # If append is not True, existing data in the group is overwrited.
    data3 = generate_data(10, base='item3')
    writer.write(data3, 'group2', append=True)  # 120 items
    writer.write(data3, 'group2')               # 10 items


##########################
# Reading data from a file
##########################

# Initialize a reader and load the entire group. A notable difference
# with the Writer is that a Reader is attached to a specific group of
# a file. This allows optimized read operations.
rdata = h5f.Reader('exemple.h5', 'group1').read()

# Hopefully we read the same data we just wrote
assert rdata == data

# Some more advance reading facilities
with h5f.Reader('exemple.h5', 'group1') as reader:
    # Same as before, read the whole data
    whole_data = reader.read()

    # Read the first item stored on the group.
    first_item = reader.items.data[0]
    rdata = reader.read(first_item)
    assert len(rdata.items()) == 1

    # Read an interval composed of the 10 first items.
    tenth_item = reader.items.data[9]
    rdata = reader.read(first_item, tenth_item)
    assert len(rdata.items()) == 10


#####################
# Playing with labels
#####################

# Previous exemples shown writing and reading labels associated to 1D
# times information (each feature vector correspond to a single
# timestamp, e.g. the center of a time window). In more advanced
# processing you may want to store 2D times information (e.g. begin
# and end of a time window). For now non-numerical labels or not
# supported.

data = generate_data(100, labeldim=2)
h5f.Writer('exemple.h5').write(data, 'group3')

rdata = h5f.Reader('exemple.h5', 'group3').read()
assert rdata == data

# Remove the writed file
from os import remove
remove('exemple.h5')