posted on Tue, Feb 16 '21 under tag: code

In part 2 of this series, we look at how varnam “learns” patterns before doing transliteration

Note: This post wasn’t published till 3rd February, 2024 because I lost interest in it. It eventually got published (at the older date) because I didn’t want to just discard them. But it hasn’t been updated or completed.

The --learn-from flag passes each file in the path to varnam_learn_from_file in learn.c. We will look at it closely in this post. But, before that I need to write about a few thoughts I have been having after discussing the previous post with other collaborators of the varnam project.

Look at this function:

static int
execute_sql(varnam *handle, sqlite3 *db, const char *sql)
{
    char *zErrMsg = 0;
    int rc;

    rc = sqlite3_exec(db, sql, NULL, 0, &zErrMsg);
    if( rc != SQLITE_OK ){
        set_last_error (handle, "Failed to execute : %s", zErrMsg);
        sqlite3_free(zErrMsg);
        return VARNAM_ERROR;
    }

    return VARNAM_SUCCESS;
}

And an example of how it is called:

rc = execute_sql (handle, v_->known_words, "create index tmp_patterns_content_word_id on patterns_content (word_id);");
if (rc != VARNAM_SUCCESS)
    return rc;

C does not have exceptions. Error handling is through return values. That’s why most functions of varnam takes an input and pointer to an output as parameters and returns an integer error/success code. When returned an error code the caller can call varnam_get_last_error to get what the error was. In python or javascript, this would be handled with exceptions. Turns out that is not very common in systems programming languages because raising exceptions and exiting a function can lead to forgetting to close file descriptors, etc that can lead to leak. Go language also promotes using return values to signify error but this is easier in Go because a function can return multiple values in Go making it ripe for code like this:

f, err := os.Open("filename.ext")
if err != nil {
    log.Fatal(err)
}
// do something with the open *File f

This is very elegant when compared to what one has to do in C. In Rust too, the idiomatic way to handle such situations is to use a Result enum. Although panic is present in Rust (and Go), it is not used regularly, like exceptions

use std::fs::File;
use std::io::ErrorKind;

fn main() {
    let f = File::open("hello.txt");

    let f = match f {
        Ok(file) => file,
        Err(error) => match error.kind() {
            ErrorKind::NotFound => match File::create("hello.txt") {
                Ok(fc) => fc,
                Err(e) => panic!("Problem creating the file: {:?}", e),
            },
            other_error => {
                panic!("Problem opening the file: {:?}", other_error)
            }
        },
    };
}

In these languages, we will be able to return the result directly and therefore avoid the need for passing in a pointer to each function to store the output.

The next point is about coupling. Varnam seems to be tightly coupled to sqlite at the moment. It is not straightforward to implement interfaces in C. Also, varnam uses many sqlite features when doing its thing.

But it might be nice to have the database/storage layer decoupled. Sqlite is not appropriate for high-volume websites. And there could be other creative solutions on database side if the code is decoupled.

About language rules. The schema files (in ruby) are how varnam encodes language rules now. I haven’t looked at them yet. But it is a very interesting problem. How do you represent the rules of languages like Malayalam in very few lines of code (to avoid issues like this). Is it even possible to capture all rules in code? It might make sense to have a set of test words (from multiple languages) that include various tricky transliterations. That will give us confidence that whatever language rules we have will cover all the difficult words.

Like what you are reading? Subscribe (by RSS, email, mastodon, or telegram)!