Repository management

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Repository management

Allan McRae
Hi all,

Every time I attempt to work on repo-add, I find it to be a very
difficult endeavour.  Even though it is half the size of makepkg
(without even including any of libmakepkg), it is much more convoluted
to work on.

We also have a weird repository database system.  We have:
- .db dbs with package information, signatures and delta information
- .files dbs that are the same as .db dbs but additionally include filelists

There are two reasons the .files dbs replicate all information in the
.db dbs
 - .db and .files dbs getting out of sync could cause issues
 - a complete database is useful for things like archweb, mostly to
avoid the above

I would also like to include information on source packages to these
databases.  The files information is separate due to wanting our primary
database to be small.  Likewise, source package information needs to be
separate (the signatures take most of the size in the .db dbs, so adding
source package signatures effectively doubles the size).

So two points up for discussion:


1) Sync repository layout?  I don't see any point in leaving the tar
based format, as reading of sync databases is not a bottleneck.  (The
local db format can be a bottleneck, but that is a separate discussion...)

Do we split the information in .db out of .files and add a .full db with
complete information?  Then any .src db could follow suit and just have
source package information.  How do we get around the out of sync issue
(e.g., a package is removed from .db, but we have an old .files database
with it).  Do we add timestamps, and print a warning on -F operations
when the two are out of sync?


2) Do we need a better (read "more easily maintainable") tool for
handling database generation and updates?  libalpm already can read in
information package files, so we could add libalpm/db_write.c with the
database creation functions.   Should we unify our repo format with our
local database format which we already write?


I am looking for ideas here.  Please brainstorm to your hearts content.

Cheers,
Allan
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Repository management

dave reisner
On Tue, May 09, 2017 at 10:54:44PM +1000, Allan McRae wrote:

> Hi all,
>
> Every time I attempt to work on repo-add, I find it to be a very
> difficult endeavour.  Even though it is half the size of makepkg
> (without even including any of libmakepkg), it is much more convoluted
> to work on.
>
> We also have a weird repository database system.  We have:
> - .db dbs with package information, signatures and delta information
> - .files dbs that are the same as .db dbs but additionally include filelists
>
> There are two reasons the .files dbs replicate all information in the
> .db dbs
>  - .db and .files dbs getting out of sync could cause issues
>  - a complete database is useful for things like archweb, mostly to
> avoid the above
>
> I would also like to include information on source packages to these
> databases.  The files information is separate due to wanting our primary
> database to be small.  Likewise, source package information needs to be
> separate (the signatures take most of the size in the .db dbs, so adding
> source package signatures effectively doubles the size).
>
> So two points up for discussion:
>
>
> 1) Sync repository layout?  I don't see any point in leaving the tar
> based format, as reading of sync databases is not a bottleneck.  (The
> local db format can be a bottleneck, but that is a separate discussion...)

Isn't this a historical reversal? IIRC, the sync DBs used to be expanded
onto disk, and we decided to leave them as tarballs to address
performance/fragmentation concerns.

> Do we split the information in .db out of .files and add a .full db with
> complete information?  Then any .src db could follow suit and just have
> source package information.  How do we get around the out of sync issue
> (e.g., a package is removed from .db, but we have an old .files database
> with it).  Do we add timestamps, and print a warning on -F operations
> when the two are out of sync?
>
>
> 2) Do we need a better (read "more easily maintainable") tool for
> handling database generation and updates?  libalpm already can read in
> information package files, so we could add libalpm/db_write.c with the
> database creation functions.   Should we unify our repo format with our
> local database format which we already write?
>

I'd urge you not to make this a part of pacman. It's too far off the
beaten path for most users to make it a part of an already complicated
tool.

>
> I am looking for ideas here.  Please brainstorm to your hearts content.

WRT replacing repo-add, I'd suggest we come up with a the use cases we
want to support, design an interface to meet them, and then come up with
the implementation. Might be nice to start with the Arch Linux
repository layout as an example that we'd want to support (pooled
packages with symlinks into repo dirs).

> Cheers,
> Allan
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Repository management

Allan McRae
On 11/05/17 02:54, Dave Reisner wrote:

> On Tue, May 09, 2017 at 10:54:44PM +1000, Allan McRae wrote:
>> Hi all,
>>
>> Every time I attempt to work on repo-add, I find it to be a very
>> difficult endeavour.  Even though it is half the size of makepkg
>> (without even including any of libmakepkg), it is much more convoluted
>> to work on.
>>
>> We also have a weird repository database system.  We have:
>> - .db dbs with package information, signatures and delta information
>> - .files dbs that are the same as .db dbs but additionally include filelists
>>
>> There are two reasons the .files dbs replicate all information in the
>> .db dbs
>>  - .db and .files dbs getting out of sync could cause issues
>>  - a complete database is useful for things like archweb, mostly to
>> avoid the above
>>
>> I would also like to include information on source packages to these
>> databases.  The files information is separate due to wanting our primary
>> database to be small.  Likewise, source package information needs to be
>> separate (the signatures take most of the size in the .db dbs, so adding
>> source package signatures effectively doubles the size).
>>
>> So two points up for discussion:
>>
>>
>> 1) Sync repository layout?  I don't see any point in leaving the tar
>> based format, as reading of sync databases is not a bottleneck.  (The
>> local db format can be a bottleneck, but that is a separate discussion...)
>
> Isn't this a historical reversal? IIRC, the sync DBs used to be expanded
> onto disk, and we decided to leave them as tarballs to address
> performance/fragmentation concerns.

To be clear, I was saying to stay tar based and not to move to something
else.

>> Do we split the information in .db out of .files and add a .full db with
>> complete information?  Then any .src db could follow suit and just have
>> source package information.  How do we get around the out of sync issue
>> (e.g., a package is removed from .db, but we have an old .files database
>> with it).  Do we add timestamps, and print a warning on -F operations
>> when the two are out of sync?
>>
>>
>> 2) Do we need a better (read "more easily maintainable") tool for
>> handling database generation and updates?  libalpm already can read in
>> information package files, so we could add libalpm/db_write.c with the
>> database creation functions.   Should we unify our repo format with our
>> local database format which we already write?
>>
>
> I'd urge you not to make this a part of pacman. It's too far off the
> beaten path for most users to make it a part of an already complicated
> tool.
>

Definitely not part of pacman.  I was suggesting another program with a
libalpm backend.

>>
>> I am looking for ideas here.  Please brainstorm to your hearts content.
>
> WRT replacing repo-add, I'd suggest we come up with a the use cases we
> want to support, design an interface to meet them, and then come up with
> the implementation. Might be nice to start with the Arch Linux
> repository layout as an example that we'd want to support (pooled
> packages with symlinks into repo dirs).
>
>> Cheers,
>> Allan
> .
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Repository management

Andrew Gregory
In reply to this post by Allan McRae
On 05/09/17 at 10:54pm, Allan McRae wrote:

> Hi all,
>
> Every time I attempt to work on repo-add, I find it to be a very
> difficult endeavour.  Even though it is half the size of makepkg
> (without even including any of libmakepkg), it is much more convoluted
> to work on.
>
> We also have a weird repository database system.  We have:
> - .db dbs with package information, signatures and delta information
> - .files dbs that are the same as .db dbs but additionally include filelists
>
> There are two reasons the .files dbs replicate all information in the
> .db dbs
>  - .db and .files dbs getting out of sync could cause issues
>  - a complete database is useful for things like archweb, mostly to
> avoid the above
>
> I would also like to include information on source packages to these
> databases.  The files information is separate due to wanting our primary
> database to be small.  Likewise, source package information needs to be
> separate (the signatures take most of the size in the .db dbs, so adding
> source package signatures effectively doubles the size).
>
> So two points up for discussion:
>
>
> 1) Sync repository layout?  I don't see any point in leaving the tar
> based format, as reading of sync databases is not a bottleneck.  (The
> local db format can be a bottleneck, but that is a separate discussion...)
>
> Do we split the information in .db out of .files and add a .full db with
> complete information?  Then any .src db could follow suit and just have
> source package information.  How do we get around the out of sync issue
> (e.g., a package is removed from .db, but we have an old .files database
> with it).  Do we add timestamps, and print a warning on -F operations
> when the two are out of sync?
 
What about just not including the signature in the database?  Make the
inclusion of the signature optional and have pacman (or whatever
downloads the source package) also look for a corresponding .sig file
if it's not in the db.  pacman -U already looks for a .sig file when
downloading a package and you have a feature request to download .sig
files even with -S, so code-wise this seems like a pretty clean
solution. Then you can include the source information right in the
primary DB and Arch's devtools can opt to omit the signature from the
db.
 
> 2) Do we need a better (read "more easily maintainable") tool for
> handling database generation and updates?  libalpm already can read in
> information package files, so we could add libalpm/db_write.c with the
> database creation functions.   Should we unify our repo format with our
> local database format which we already write?

I would love to see us drop the ini-style .PKGINFO format, if that's
what you mean.  Even without adding a database writer to libalpm,
having two formats for the exact same data is unnecessary and leads to
inconsistencies between the two.

apg
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Repository management

xyne
In reply to this post by Allan McRae
On 2017-05-09 22:54 +1000
Allan McRae wrote:

>I am looking for ideas here.  Please brainstorm to your hearts content.

Ok :)


>So two points up for discussion:
>
>
>1) Sync repository layout?  I don't see any point in leaving the tar
>based format, as reading of sync databases is not a bottleneck.  (The
>local db format can be a bottleneck, but that is a separate discussion...)
>
>Do we split the information in .db out of .files and add a .full db with
>complete information?  Then any .src db could follow suit and just have
>source package information.  How do we get around the out of sync issue
>(e.g., a package is removed from .db, but we have an old .files database
>with it).  Do we add timestamps, and print a warning on -F operations
>when the two are out of sync?

Add a timestamp inside each database (*.db, *.files, *.src). When pacman
downloads a database, instead of saving it as <repo>.<ext> and squashing the
previous database, save it as <repo>-<timestamp>.<ext>. Each refresh operation
(pacman -Sy, pacman -Fy) is associated with a particular database (*.db and
*.files, respectively). Create an untimestamped symlink to that database, e.g.

$ pacman -Sy...
# retrieve <repo>.db and save as <repo>-<timestamp_1>.db
# ln -s <repo>-<timestamp_1>.db <repo>.db

$ pacman -Fy...
# retrieve <repo>.db and save as <repo>-<timestamp_2>.db
# retrieve <repo>.files and save as <repo>-<timestamp_2>.files
# ln -s <repo>-<timestamp_2>.files <repo>.files

# something similar for *.src files

For operations that only involve the current <repo>.db files, no change is
needed for loading the database.

For loading <repo>.files, you will need to dereference <repo>.files first,
grab <timestamp_2> from <repo>-<timestamp_2>.files in the example above, and
then use it to load <repo>-<timestamp_2>.db instead of <repo>.db. Same method
for *.src files.

For cleanup of the timestamped files, collect the valid timestamps from the
untimestamped symlinks and then remove anything that doesn't match them. This
should probably be done with each database refresh. Maybe you can use the same
function that you use to clean up the package cache with -Sc while leaving
installed packages.

Obviously there will be some redundancy in the up to 3 copies of
<repo>-<timestamp>.db but I think that's better than e.g. breaking pkgfile
searches after an upgrade.

With this approach you could also download the latest version of the sync
databases as <repo>-<timestamp>.db without symlinking <repo>.db to it, and then
use that to query upgradable packages and other info from the mirror.

For propagating the database to the servers, nothing changes. Whenever the
database is updated, generate <repo>.db, <repo>.files, <repo>.src and whatever
else at the same time with the same internal timestamp and then just push them
out as usual.


>2) Do we need a better (read "more easily maintainable") tool for
>handling database generation and updates?  libalpm already can read in
>information package files, so we could add libalpm/db_write.c with the
>database creation functions.   Should we unify our repo format with our
>local database format which we already write?

Yes for unification, preferably in a standardized format (e.g. yaml). Having
the functionality to read and write the files in libalpm would be useful for
third-party tool developers.





On 2017-05-10 12:54 -0400
Dave Reisner wrote:

>WRT replacing repo-add, I'd suggest we come up with a the use cases we
>want to support, design an interface to meet them, and then come up with
>the implementation. Might be nice to start with the Arch Linux
>repository layout as an example that we'd want to support (pooled
>packages with symlinks into repo dirs).

What about using a relative subpath instead of a filename in the database. That
would enable transparent freeform repo layouts (e.g. pooled packages without
symlinks, package groups in different subdirs, etc.).

You could also avoid the need for subdirectories by adding the architecture
to the database filename, e.g. <repo>.<arch>.<ext>



To simplify repo-add, you could include .SRCINFO directly to avoid parsing and
reformatting/rewriting that metadata. Keep it as a separate file then add a new
one (call it PKGINFO?) for information about the *.pkg.* file itself (build
date, packager, signature, checksum, size, relative filepath, etc.). Add other
files to contain related information (e.g. INSTALLINFO with install time, file
list, install origin?). That way, each step copies existing files and adds a
new one with the new info (repo-add: collect SRCINFO, add PKGINFO; install a
package: copy SRCINFO AND PKGINFO to local db, create INSTALLINFO etc.)

A repo metadata file would also be required in the root directory with the repo
timestamp for the timestamped databases described above. The file could also
collect other metadata such as package providers and maybe replacements to
speed up some operations.


Regards,
Xyne
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Repository management

xyne
Xyne wrote:

>Obviously there will be some redundancy in the up to 3 copies of
><repo>-<timestamp>.db but I think that's better than e.g. breaking pkgfile
>searches after an upgrade.

Just to expand on that, the worst case scenario leads to the same level of
redundancy as we currently have with complete *.files databases, while the best
case leads to no redundancy, all the while preserving the independence of
pacman -S... and pacman -F... (and whatever else you want to add).

>With this approach you could also download the latest version of the sync
>databases as <repo>-<timestamp>.db without symlinking <repo>.db to it, and then
>use that to query upgradable packages and other info from the mirror.

To make that work with my suggestion for cleaning up old timestamped databases,
add a symlink named e.g. <repo>.future, <repo>.next or <repo>.remote. That could
be used by e.g. checkupdates or pre-emptive package downloading scripts.

There may even be cases where the cleanup is unwanted, such as for a script
that regularly downloads databases and upgradable packages to provide an
incremental upgrade path at a later date (obviously regular updates are
preferred, but maybe useful and reasonable in some rare cases).

In my previous reply, I had forgotten that pacman -Sc prompts for the database
and pkgcache cleanups independently. Forget what I said about automatic
cleanups. Offload that to pacman -Sc.

Regards,
Xyne
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Unifying package information files - Was: Repository management

Allan McRae
In reply to this post by Andrew Gregory
On 11/05/17 07:54, Andrew Gregory wrote:

>> 2) Do we need a better (read "more easily maintainable") tool for
>> handling database generation and updates?  libalpm already can read in
>> information package files, so we could add libalpm/db_write.c with the
>> database creation functions.   Should we unify our repo format with our
>> local database format which we already write?
>
> I would love to see us drop the ini-style .PKGINFO format, if that's
> what you mean.  Even without adding a database writer to libalpm,
> having two formats for the exact same data is unnecessary and leads to
> inconsistencies between the two.


I was not considering .PKGINFO when I wrote that, although it is a good
point...

Currently we have the following:
https://wiki.archlinux.org/index.php/User:Allan/Pacman_DB_Format

Notice the local and sync database formats are near identical (there are
some field differences), but we use two different functions to read
them, where the main differences is what fgets variant gets used - this
is what I was talking about unifying.  We already have the ability to
write that format in libalpm given we write the local db entry, so could
extend that as the basis of writing repo databases via a libalpm tool too.


So, expanding on this idea.  It would be great to have a single package
information reader that covered .PKGINFO files, local database files,
and sync database files.  To do this .PKGINFO files (and assumably
.BUILDINFO) would need to change to the same format as the database files.

How would we make such a transition?  Add a new file into the package
(e.g. .PACKAGE) that has the new format.  Have pacman read the new
format if available, but fall back to old format if it is not available.
 Then wait a release or two to remove support for the old format?

Allan
Loading...