Jarosław Szmigielski's techblog

My personal adventure with IT

Music Cataloger

January 19th, 2013

I’ve finally found some time to realize my concept written here. In the meantime I’ve also managed to connect external hard drive to my Raspberry Pi. I have a lot of problems with it and tried lots of approaches but finally it’s working (I will describe it in the next post).

Now my music library is available in Raspberry Pi 🙂 It’s quite a lot of mp3s so I decided to write a quick little script in Python that will fetch basic data about file name, location, size, song duration, bitrate and data from ID3 tags. I’m also computing checksum of file to discover duplicates. All this data is later stored in MySQL database.

At first I prepared a SQL script with table creation:

  `file_name` text COLLATE utf8_unicode_ci NOT NULL,
  `file_format` text COLLATE utf8_unicode_ci NOT NULL,
  `file_location` text COLLATE utf8_unicode_ci NOT NULL,
  `file_checksum` text COLLATE utf8_unicode_ci NOT NULL,
  `file_size` INT(11) NOT NULL,
  `tag_title` text COLLATE utf8_unicode_ci NOT NULL,
  `tag_artist` text COLLATE utf8_unicode_ci NOT NULL,
  `tag_album` text COLLATE utf8_unicode_ci NOT NULL,
  `tag_year` text COLLATE utf8_unicode_ci NOT NULL,
  `tag_comment` text COLLATE utf8_unicode_ci NOT NULL,
  `tag_genre` text COLLATE utf8_unicode_ci NOT NULL,
  `tag_tracknumber` INT(11) NOT NULL,
  `song_length` INT(11) NOT NULL,
  `song_bitrate` INT(11) NOT NULL,
  `date_added` datetime NOT NULL,
  `health` INT(11) NOT NULL,
  PRIMARY KEY (`id`)

The script is quite straightforward so I will explain it briefly. To fetch ID3 tags I’ve used Mutagen library and MySQLdb for database connection. Here is the code:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @author           Jaroslaw Szmigielski
# @description      Music Cataloger
# @created          17.01.2013
import sys
from os import listdir
from os.path import isfile, isdir, join, getsize
from mutagen.mp3 import MP3
import MySQLdb
import subprocess
import os
import hashlib
mysql_server = "localhost"
mysql_user = "user"
mysql_pass = "password"
mysql_db = "dbname"
def ConnectToDB():
        conn = MySQLdb.connect(mysql_server, mysql_user, mysql_pass, mysql_db, charset='utf8', use_unicode=True)
        return conn.cursor()
    except MySQLdb.OperationalError, err:
def md5sum(filename):
    md5 = hashlib.md5()
    with open(filename,'rb') as f:
        for chunk in iter(lambda: f.read(128*md5.block_size), b''):
    return md5.hexdigest()
supported_extensions = ["mp3", "wav"]
startpath = '/media/'
directory_list = []
if __name__ == "__main__":
    cursor = ConnectToDB()
    while(len(directory_list) > 0):
        popped = directory_list.pop()
        directories = [ f for f in listdir(popped) if isdir(join(popped, f)) ]
        for file in directories:
            directory_list.append(join(popped, file))
        files = [ f for f in listdir(popped) if isfile(join(popped, f)) ]
        for file in files:
            if file[file.rfind(".")+1:] not in supported_extensions:
            file_location = join(popped,file)
            file_checksum = md5sum(file_location)
            #file_checksum = ""
            file_format = "mp3"
            file_name = file
            file_size = getsize(file_location)
            extension = file[file.rfind(".")+1:]
            audio = ""
            if(extension == "mp3"):
                audio = MP3(file_location)
                song_length = int(round(audio.info.length))
            song_bitrate = int(round(audio.info.bitrate))
            tag_title = ""
            tag_artist = ""
            tag_album = ""
            tag_year = ""
            tag_comment = ""
            tag_genre = ""
            tag_tracknumber = ""
            if "TIT2" in audio.keys():
                tag_title = audio["TIT2"].text[0]
            if 'TPE1' in audio.keys():
                tag_artist = audio['TPE1'].text[0]
            if 'TALB' in audio.keys():
                tag_album = audio['TALB'].text[0]
            if 'TDRC' in audio.keys():
                tag_year = audio["TDRC"].text[0]
            if "COMM" in audio.keys():
                tag_comment = audio["COMM"].text[0]
            if "TCON" in audio.keys():
                    tag_genre = audio["TCON"].text[0]
                except IndexError,e:
                    tag_genre = ""
            if "TRCK" in audio.keys():
                tag_tracknumber = audio["TRCK"]
                tag_tracknumber = int(tag_tracknumber)
                tag_tracknumber = -1
            file_location = file_location.decode('utf8').replace("'", u"\u0301")
            file_name = file_name.decode('utf8').replace("'", u"\u0301")
            tag_title = tag_title.replace("'", u"\u0301")
            tag_artist = tag_artist.replace("'", u"\u0301")
            tag_album = tag_album.replace("'", u"\u0301")
            tag_comment = tag_comment.replace("'", u"\u0301")
            tag_genre = tag_genre.replace("'", u"\u0301")
            execute = u"insert into music (file_name, file_format, file_location, file_checksum, file_size, tag_title, tag_artist, tag_album, tag_year, tag_comment, tag_genre, tag_tracknumber, song_length, song_bitrate, date_added, health) values "
            execute += u"('%s', '%s', '%s', '%s', %d, '%s', '%s', '%s', '%s' ,'%s', '%s', %d, %d, %d, NOW(), 0)" % (file_name, file_format, file_location, file_checksum, int(file_size), tag_title, tag_artist, tag_album, tag_year, tag_comment, tag_genre, int(tag_tracknumber), int(song_length), int(song_bitrate))

I needed to use utf-8 encoding because I have music in many different languages and the ID3 tags there vary a lot. Script uses simplified version of DFS algorithm to go through directories and it will visit all the locations in given path. I’m making replaces at the end to avoid >'< sign, that prevents from breaking SQL insert query. Sorry for the quality of source code, it had been written for about 20 min 😉

With this kind of data in your database you can easily get very interesting information about your library, for example amount of files, total time of songs or disk space that they use. To discover if there are any duplicates in your library use this SQL query:

  FROM music c
  JOIN (SELECT c.file_checksum
          FROM music c
      GROUP BY c.file_checksum
        HAVING COUNT(*) &gt; 1) x ON x.file_checksum = c.file_checksum

I’ve also set up mpd and Icecast server on Raspberry Pi to listen to my music collection wherever I am (together with MPDroid on Android device). I will describe it in later post.

I hope you’ve found this post useful 🙂 Try it on your own music library!