Tagging files and folders using hashtags and symlinks

There is lots of tools out there that let you organise files (specially your picture archive). However, they are all depending on some sort of database, one master computer to add the tags from, and you can't browse the organised files in their organised structure from all devices.

I made this project because I had this exact problem organising my own pictures. I wanted something which:

  • You could tag pictures as close to where you look at them (IE, the file browser itself)
  • Is platform independent.
    • Like real platform independent! I wanted to browse this tags on my TV!
  • Not another thing to backup. I am already backing up the pictures themselves.
  • Support organising whole folders, not just single files.

There is probably many more than myself that is annoyed by this problem, therefor I will share my solution, which is a python script that goes trough all the files in a directory, looks at the filenames and looks for hashtags. This is put into my storage NAS's crontab and runs every hour.

Example

File structure

xeor@omi { ~/Documents/my_pictures }$ find .
.
./2012 #Business trip to #USA
./2012 #Business trip to #USA/dcim0123 #People-Lars.jpg
./2012 #Business trip to #USA/dcim0124.jpg
./2012 #Business trip to #USA/dcim0125.jpg
./2012 #Business trip to #USA/dcim0126 #Conference.jpg

Running taggo

taggo run_once

Tags created

xeor@omi { ~/Documents/tags }$ find . 
.
./Business
./Business/root - 2012 #Business trip to #USA
./Conference
./Conference/2012 #Business trip to #USA - dcim0126 #Conference.jpg
./People
./People/Lars
./People/Lars/2012 #Business trip to #USA - dcim0123 #People-Lars.jpg
./USA
./USA/root - 2012 #Business trip to #USA

Explaination

As you can see on the file structure, we created one folder and 4 files. The folder itself 2012 #Business trip to #USA have two tags, #Business and #USA (as you probably already knew :) ) The dcim0123 file have a tag like #People-Lars, which means that taggo should threat it as a sub tag.

The list of tags created is now just a bunch of symlinks to the original files. ./USA/root - 2012 #Business trip to #USA is a link to the folder called 2012 #Business trip to #USA, the same with ./Business/root - 2012 #Business trip to #USA. For our sub tag, you can see that it is in the directory People/Lars; ./People/Lars/2012 #Business trip to #USA - dcim0123 #People-Lars.jpg.

Configuration

In the file called taggo.cfg you can define stuff like tag indicator (the hashtag), sub tag separator, what filename the symlinked filenames should get (default is %(rel_folders)s - %(basename)s), what to replace / with in tag filenames, content folder and tag folder.

Taggo will automatically create the taggo.cfg file when you run it the first time. (Just do a ./taggo)

Usage

Using taggo is simple, just put it in any directory and put something like 22 * * /usr/bin/python /path/to/taggo run_once in the crontab. It will make sure that new symlinks is created.

If you rename a file, the symlink will die. But when you use the run_once parameter, it will automatically delete the invalid symlinks. I have been very careful when creating the delete function. It will only delete symlinks where the paths they point to does not exists. and to delete the empty directories, we are using os.rmdir, which is a python function that is made to delete empty directories only.

To find and use the project, check out the Github link at the top of this article.


Python threading example, creating Pinger.py

Update 18. Nov 2012: Cleaned up some comments about cores. To make it clear, this will only run on 1 core!

Threading in Python can be confusing in the beginning. Many examples out there are overly complicated so here is another example that I have tried to keep simple.

Here, I want a fast way to ping every host/ip in a list. As fast as we can, threaded, and then at last return a dict with two items. A list of dead nodes, and a list of nodes who answers on ping.

Example:

In [1]: from pinger import Pinger
In [2]: ping = Pinger()
In [3]: ping.thread_count = 8
In [4]: ping.hosts = ['10.0.0.1', '10.0.0.255', '10.0.0.100', 'google.com', 'nonexisting', '*not able to ping!*', '8.8.8.8']
In [5]: ping.start()
Out[5]: 
{'alive': ['10.0.0.255', '10.0.0.1', 'google.com', '8.8.8.8'],
 'dead': ['*not able to ping!*', 'nonexisting', '10.0.0.100']}

The example above will ping 8 hosts at the time and saving the results to the end. We are using 8 thread_count in this example. Which means that python will have 8 ping command running at the same time.

The whole source of the Pinger class looks like this, read the comments and you will see how it works:

#!/usr/bin/env python

import subprocess
import threading

class Pinger(object):
    status = {'alive': [], 'dead': []} # Populated while we are running
    hosts = [] # List of all hosts/ips in our input queue

    # How many ping process at the time.
    thread_count = 4

    # Lock object to keep track the threads in loops, where it can potentially be race conditions.
    lock = threading.Lock()

    def ping(self, ip):
        # Use the system ping command with count of 1 and wait time of 1.
        ret = subprocess.call(['ping', '-c', '1', '-W', '1', ip],
                              stdout=open('/dev/null', 'w'), stderr=open('/dev/null', 'w'))

        return ret == 0 # Return True if our ping command succeeds

    def pop_queue(self):
        ip = None

        self.lock.acquire() # Grab or wait+grab the lock.

        if self.hosts:
            ip = self.hosts.pop()

        self.lock.release() # Release the lock, so another thread could grab it.

        return ip

    def dequeue(self):
        while True:
            ip = self.pop_queue()

            if not ip:
                return None

            result = 'alive' if self.ping(ip) else 'dead'
            self.status[result].append(ip)

    def start(self):
        threads = []

        for i in range(self.thread_count):
            # Create self.thread_count number of threads that together will
            # cooperate removing every ip in the list. Each thread will do the
            # job as fast as it can.
            t = threading.Thread(target=self.dequeue)
            t.start()
            threads.append(t)

        # Wait until all the threads are done. .join() is blocking.
        [ t.join() for t in threads ]

        return self.status

if __name__ == '__main__':
    ping = Pinger()
    ping.thread_count = 8
    ping.hosts = [
        '10.0.0.1', '10.0.0.2', '10.0.0.3', '10.0.0.4', '10.0.0.0', '10.0.0.255', '10.0.0.100',
        'google.com', 'github.com', 'nonexisting', '127.0.1.2', '*not able to ping!*', '8.8.8.8'
        ]

    print ping.start()

Blog technology

After going back and forth to what technology I wanted behind my blog I decided on;

Pelican as the static blog generator

Pelican is written in python, is very extendible with plugins and easy to create themes. It is also easy to configure and use. The main reason I went with pelican is its simplicity, and possibility to customize.

There is already other blogs out there that explains pelican advantages and disadvantages and other blogs that have info about using github pages and pelican, so that is not something I will spend time on here. But if you like to blog using plain-text, python, html/js/css customization and a power full generator to put it into a blog, pelican might be something to check out.

Multimarkdown as the writing "format"

Multimarkdown is an extension to markdown. Markdown is a structured way of writing articles, snippets, mail or even whole books. It was created as a way to write plaintext which can later be converted to html/pdf/odt/LaTeX or whatever you want, keeping the structure you want.

To be honest, everyone who sends mail on a daily bases should at least look into this. Or at least thing about it. Getting mails that contains a lot of text, and no structure is painful to read.

When it comes to Markdown vs. reStructuredText, I ended up with markdown because it feels much bigger than rst. Even tough rst is something which the python community uses a lot, it just feels a little dead. I have even tried to use rst for a long time, but it is missing some love from other people.

Github pages for hosting the generated html files

I love using Github for my opensourced projects, so it felt very natural to use their pages to store the html files for my blog. It is free, easy to publish to, and stable. I don't really have much more to say on this. But if my blog was not going to be a bunch of static files, I would probably have used Heroku.